Node manager policy matrix

arvados node state (last_ping_at, crunch_worker_state):
  • (no arvados record associated with cloud node) -> unpaired
  • last_ping_at is stale -> down
  • slurm state is idle -> idle
  • slurm state is drng, alloc, maint -> busy
  • slurm state is drain, down, error, fail, unknown, any* -> down
billing window:
  • open
  • closed
boot_grace: (time since boot)
  • under boot expiry -> boot wait
  • exceeded boot expiry -> boot exceeded
idle_grace (time since last state change to "idle")
  • arvados node state is not 'idle' -> not idle
  • idle and not exceeded grace period -> idle wait
  • idle and exceeded grace period -> idle exceed

Node manager will construct a state tuple and then consult the following table to determine what action to take. Actions are:

  • None (do nothing)
  • START_DRAIN (put the node into slurm draining state)
  • START_SHUTDOWN (initiate cloud shutdown)
crunch_worker_state = ['unpaired', 'busy', 'idle', 'down']
window = ["open", "closed"]
boot_grace = ["boot wait", "boot exceeded"]
idle_grace = ["not idle", "idle wait", "idle exceeded"]

{('busy', 'closed', 'boot exceeded', 'idle exceeded'): None,
 ('busy', 'closed', 'boot exceeded', 'idle wait'): None,
 ('busy', 'closed', 'boot exceeded', 'not idle'): None,
 ('busy', 'closed', 'boot wait', 'idle exceeded'): None,
 ('busy', 'closed', 'boot wait', 'idle wait'): None,
 ('busy', 'closed', 'boot wait', 'not idle'): None,
 ('busy', 'open', 'boot exceeded', 'idle exceeded'): None,
 ('busy', 'open', 'boot exceeded', 'idle wait'): None,
 ('busy', 'open', 'boot exceeded', 'not idle'): None,
 ('busy', 'open', 'boot wait', 'idle exceeded'): None,
 ('busy', 'open', 'boot wait', 'idle wait'): None,
 ('busy', 'open', 'boot wait', 'not idle'): None,

 ('down', 'closed', 'boot exceeded', 'idle exceeded'): "START_SHUTDOWN",
 ('down', 'closed', 'boot exceeded', 'idle wait'): "START_SHUTDOWN",
 ('down', 'closed', 'boot exceeded', 'not idle'): "START_SHUTDOWN",
 ('down', 'closed', 'boot wait', 'idle exceeded'): "START_SHUTDOWN",
 ('down', 'closed', 'boot wait', 'idle wait'): "START_SHUTDOWN",
 ('down', 'closed', 'boot wait', 'not idle'): "START_SHUTDOWN",
 ('down', 'open', 'boot exceeded', 'idle exceeded'): "START_SHUTDOWN",
 ('down', 'open', 'boot exceeded', 'idle wait'): "START_SHUTDOWN",
 ('down', 'open', 'boot exceeded', 'not idle'): "START_SHUTDOWN",
 ('down', 'open', 'boot wait', 'idle exceeded'): "START_SHUTDOWN",
 ('down', 'open', 'boot wait', 'idle wait'): "START_SHUTDOWN",
 ('down', 'open', 'boot wait', 'not idle'): "START_SHUTDOWN",

 ('idle', 'closed', 'boot exceeded', 'idle exceeded'): None,
 ('idle', 'closed', 'boot exceeded', 'idle wait'): None,
 ('idle', 'closed', 'boot exceeded', 'not idle'): None,
 ('idle', 'closed', 'boot wait', 'idle exceeded'): None,
 ('idle', 'closed', 'boot wait', 'idle wait'): None,
 ('idle', 'closed', 'boot wait', 'not idle'): None,
 ('idle', 'open', 'boot exceeded', 'idle exceeded'): "START_DRAIN",
 ('idle', 'open', 'boot exceeded', 'idle wait'): None,
 ('idle', 'open', 'boot exceeded', 'not idle'): None,
 ('idle', 'open', 'boot wait', 'idle exceeded'): "START_DRAIN",
 ('idle', 'open', 'boot wait', 'idle wait'): None,
 ('idle', 'open', 'boot wait', 'not idle'): None,

 ('unpaired', 'closed', 'boot exceeded', 'idle exceeded'): "START_SHUTDOWN",
 ('unpaired', 'closed', 'boot exceeded', 'idle wait'): "START_SHUTDOWN",
 ('unpaired', 'closed', 'boot exceeded', 'not idle'): "START_SHUTDOWN",
 ('unpaired', 'closed', 'boot wait', 'idle exceeded'): None,
 ('unpaired', 'closed', 'boot wait', 'idle wait'): None,
 ('unpaired', 'closed', 'boot wait', 'not idle'): None,
 ('unpaired', 'open', 'boot exceeded', 'idle exceeded'): "START_SHUTDOWN",
 ('unpaired', 'open', 'boot exceeded', 'idle wait'): "START_SHUTDOWN",
 ('unpaired', 'open', 'boot exceeded', 'not idle'): "START_SHUTDOWN",
 ('unpaired', 'open', 'boot wait', 'idle exceeded'): None,
 ('unpaired', 'open', 'boot wait', 'idle wait'): None,
 ('unpaired', 'open', 'boot wait', 'not idle'): None}
Note on libcloud node states:
  • error, unknown -> broken
  • everything else -> ok

However we don't use it, it's expensive to fetch on some clouds and not as useful as knowing whether the node is actually live and in communication. A blanket policy that shuts down nodes that are unavailable to do useful work should also catch broken nodes.