Node manager policy matrix » History » Version 1
Peter Amstutz, 04/14/2016 06:12 PM
1 | 1 | Peter Amstutz | h1. Node manager policy matrix |
---|---|---|---|
2 | |||
3 | arvados node state (last_ping_at, crunch_worker_state): |
||
4 | * (no arvados record associated with cloud node) -> unpaired |
||
5 | * last_ping_at is stale -> down |
||
6 | * slurm state is idle -> idle |
||
7 | * slurm state is drng, alloc, maint -> busy |
||
8 | * slurm state is drain, down, error, fail, unknown, any* -> down |
||
9 | |||
10 | billing window: |
||
11 | * open |
||
12 | * closed |
||
13 | |||
14 | boot_grace: (time since boot) |
||
15 | * under boot expiry -> boot wait |
||
16 | * exceeded boot expiry -> boot exceeded |
||
17 | |||
18 | idle_grace (time since last state change to "idle") |
||
19 | * arvados node state is not 'idle' -> not idle |
||
20 | * idle and not exceeded grace period -> idle wait |
||
21 | * idle and exceeded grace period -> idle exceed |
||
22 | |||
23 | Node manager will construct a state tuple and then consult the following table to determine what action to take. Actions are: |
||
24 | |||
25 | * None (do nothing) |
||
26 | * START_DRAIN (put the node into slurm draining state) |
||
27 | * START_SHUTDOWN (initiate cloud shutdown) |
||
28 | |||
29 | <pre> |
||
30 | crunch_worker_state = ['unpaired', 'busy', 'idle', 'down'] |
||
31 | window = ["open", "closed"] |
||
32 | boot_grace = ["boot wait", "boot exceeded"] |
||
33 | idle_grace = ["not idle", "idle wait", "idle exceeded"] |
||
34 | |||
35 | {('busy', 'closed', 'boot exceeded', 'idle exceeded'): None, |
||
36 | ('busy', 'closed', 'boot exceeded', 'idle wait'): None, |
||
37 | ('busy', 'closed', 'boot exceeded', 'not idle'): None, |
||
38 | ('busy', 'closed', 'boot wait', 'idle exceeded'): None, |
||
39 | ('busy', 'closed', 'boot wait', 'idle wait'): None, |
||
40 | ('busy', 'closed', 'boot wait', 'not idle'): None, |
||
41 | ('busy', 'open', 'boot exceeded', 'idle exceeded'): None, |
||
42 | ('busy', 'open', 'boot exceeded', 'idle wait'): None, |
||
43 | ('busy', 'open', 'boot exceeded', 'not idle'): None, |
||
44 | ('busy', 'open', 'boot wait', 'idle exceeded'): None, |
||
45 | ('busy', 'open', 'boot wait', 'idle wait'): None, |
||
46 | ('busy', 'open', 'boot wait', 'not idle'): None, |
||
47 | |||
48 | ('down', 'closed', 'boot exceeded', 'idle exceeded'): "START_SHUTDOWN", |
||
49 | ('down', 'closed', 'boot exceeded', 'idle wait'): "START_SHUTDOWN", |
||
50 | ('down', 'closed', 'boot exceeded', 'not idle'): "START_SHUTDOWN", |
||
51 | ('down', 'closed', 'boot wait', 'idle exceeded'): "START_SHUTDOWN", |
||
52 | ('down', 'closed', 'boot wait', 'idle wait'): "START_SHUTDOWN", |
||
53 | ('down', 'closed', 'boot wait', 'not idle'): "START_SHUTDOWN", |
||
54 | ('down', 'open', 'boot exceeded', 'idle exceeded'): "START_SHUTDOWN", |
||
55 | ('down', 'open', 'boot exceeded', 'idle wait'): "START_SHUTDOWN", |
||
56 | ('down', 'open', 'boot exceeded', 'not idle'): "START_SHUTDOWN", |
||
57 | ('down', 'open', 'boot wait', 'idle exceeded'): "START_SHUTDOWN", |
||
58 | ('down', 'open', 'boot wait', 'idle wait'): "START_SHUTDOWN", |
||
59 | ('down', 'open', 'boot wait', 'not idle'): "START_SHUTDOWN", |
||
60 | |||
61 | ('idle', 'closed', 'boot exceeded', 'idle exceeded'): None, |
||
62 | ('idle', 'closed', 'boot exceeded', 'idle wait'): None, |
||
63 | ('idle', 'closed', 'boot exceeded', 'not idle'): None, |
||
64 | ('idle', 'closed', 'boot wait', 'idle exceeded'): None, |
||
65 | ('idle', 'closed', 'boot wait', 'idle wait'): None, |
||
66 | ('idle', 'closed', 'boot wait', 'not idle'): None, |
||
67 | ('idle', 'open', 'boot exceeded', 'idle exceeded'): "START_DRAIN", |
||
68 | ('idle', 'open', 'boot exceeded', 'idle wait'): None, |
||
69 | ('idle', 'open', 'boot exceeded', 'not idle'): None, |
||
70 | ('idle', 'open', 'boot wait', 'idle exceeded'): "START_DRAIN", |
||
71 | ('idle', 'open', 'boot wait', 'idle wait'): None, |
||
72 | ('idle', 'open', 'boot wait', 'not idle'): None, |
||
73 | |||
74 | ('unpaired', 'closed', 'boot exceeded', 'idle exceeded'): "START_SHUTDOWN", |
||
75 | ('unpaired', 'closed', 'boot exceeded', 'idle wait'): "START_SHUTDOWN", |
||
76 | ('unpaired', 'closed', 'boot exceeded', 'not idle'): "START_SHUTDOWN", |
||
77 | ('unpaired', 'closed', 'boot wait', 'idle exceeded'): None, |
||
78 | ('unpaired', 'closed', 'boot wait', 'idle wait'): None, |
||
79 | ('unpaired', 'closed', 'boot wait', 'not idle'): None, |
||
80 | ('unpaired', 'open', 'boot exceeded', 'idle exceeded'): "START_SHUTDOWN", |
||
81 | ('unpaired', 'open', 'boot exceeded', 'idle wait'): "START_SHUTDOWN", |
||
82 | ('unpaired', 'open', 'boot exceeded', 'not idle'): "START_SHUTDOWN", |
||
83 | ('unpaired', 'open', 'boot wait', 'idle exceeded'): None, |
||
84 | ('unpaired', 'open', 'boot wait', 'idle wait'): None, |
||
85 | ('unpaired', 'open', 'boot wait', 'not idle'): None} |
||
86 | </pre> |
||
87 | |||
88 | |||
89 | Note on libcloud node states: |
||
90 | * error, unknown -> broken |
||
91 | * everything else -> ok |
||
92 | |||
93 | However we don't use it, it's expensive to fetch on some clouds and not as useful as knowing whether the node is actually live and in communication. A blanket policy that shuts down nodes that are unavailable to do useful work should also catch broken nodes. |