Actions
Node manager policy matrix¶
arvados node state (last_ping_at, crunch_worker_state):- (no arvados record associated with cloud node) -> unpaired
- last_ping_at is stale -> down
- slurm state is idle -> idle
- slurm state is drng, alloc, maint -> busy
- slurm state is drain, down, error, fail, unknown, any* -> down
- open
- closed
- under boot expiry -> boot wait
- exceeded boot expiry -> boot exceeded
- arvados node state is not 'idle' -> not idle
- idle and not exceeded grace period -> idle wait
- idle and exceeded grace period -> idle exceed
Node manager will construct a state tuple and then consult the following table to determine what action to take. Actions are:
- None (do nothing)
- START_DRAIN (put the node into slurm draining state)
- START_SHUTDOWN (initiate cloud shutdown)
crunch_worker_state = ['unpaired', 'busy', 'idle', 'down'] window = ["open", "closed"] boot_grace = ["boot wait", "boot exceeded"] idle_grace = ["not idle", "idle wait", "idle exceeded"] {('busy', 'closed', 'boot exceeded', 'idle exceeded'): None, ('busy', 'closed', 'boot exceeded', 'idle wait'): None, ('busy', 'closed', 'boot exceeded', 'not idle'): None, ('busy', 'closed', 'boot wait', 'idle exceeded'): None, ('busy', 'closed', 'boot wait', 'idle wait'): None, ('busy', 'closed', 'boot wait', 'not idle'): None, ('busy', 'open', 'boot exceeded', 'idle exceeded'): None, ('busy', 'open', 'boot exceeded', 'idle wait'): None, ('busy', 'open', 'boot exceeded', 'not idle'): None, ('busy', 'open', 'boot wait', 'idle exceeded'): None, ('busy', 'open', 'boot wait', 'idle wait'): None, ('busy', 'open', 'boot wait', 'not idle'): None, ('down', 'closed', 'boot exceeded', 'idle exceeded'): "START_SHUTDOWN", ('down', 'closed', 'boot exceeded', 'idle wait'): "START_SHUTDOWN", ('down', 'closed', 'boot exceeded', 'not idle'): "START_SHUTDOWN", ('down', 'closed', 'boot wait', 'idle exceeded'): "START_SHUTDOWN", ('down', 'closed', 'boot wait', 'idle wait'): "START_SHUTDOWN", ('down', 'closed', 'boot wait', 'not idle'): "START_SHUTDOWN", ('down', 'open', 'boot exceeded', 'idle exceeded'): "START_SHUTDOWN", ('down', 'open', 'boot exceeded', 'idle wait'): "START_SHUTDOWN", ('down', 'open', 'boot exceeded', 'not idle'): "START_SHUTDOWN", ('down', 'open', 'boot wait', 'idle exceeded'): "START_SHUTDOWN", ('down', 'open', 'boot wait', 'idle wait'): "START_SHUTDOWN", ('down', 'open', 'boot wait', 'not idle'): "START_SHUTDOWN", ('idle', 'closed', 'boot exceeded', 'idle exceeded'): None, ('idle', 'closed', 'boot exceeded', 'idle wait'): None, ('idle', 'closed', 'boot exceeded', 'not idle'): None, ('idle', 'closed', 'boot wait', 'idle exceeded'): None, ('idle', 'closed', 'boot wait', 'idle wait'): None, ('idle', 'closed', 'boot wait', 'not idle'): None, ('idle', 'open', 'boot exceeded', 'idle exceeded'): "START_DRAIN", ('idle', 'open', 'boot exceeded', 'idle wait'): None, ('idle', 'open', 'boot exceeded', 'not idle'): None, ('idle', 'open', 'boot wait', 'idle exceeded'): "START_DRAIN", ('idle', 'open', 'boot wait', 'idle wait'): None, ('idle', 'open', 'boot wait', 'not idle'): None, ('unpaired', 'closed', 'boot exceeded', 'idle exceeded'): "START_SHUTDOWN", ('unpaired', 'closed', 'boot exceeded', 'idle wait'): "START_SHUTDOWN", ('unpaired', 'closed', 'boot exceeded', 'not idle'): "START_SHUTDOWN", ('unpaired', 'closed', 'boot wait', 'idle exceeded'): None, ('unpaired', 'closed', 'boot wait', 'idle wait'): None, ('unpaired', 'closed', 'boot wait', 'not idle'): None, ('unpaired', 'open', 'boot exceeded', 'idle exceeded'): "START_SHUTDOWN", ('unpaired', 'open', 'boot exceeded', 'idle wait'): "START_SHUTDOWN", ('unpaired', 'open', 'boot exceeded', 'not idle'): "START_SHUTDOWN", ('unpaired', 'open', 'boot wait', 'idle exceeded'): None, ('unpaired', 'open', 'boot wait', 'idle wait'): None, ('unpaired', 'open', 'boot wait', 'not idle'): None}Note on libcloud node states:
- error, unknown -> broken
- everything else -> ok
However we don't use it, it's expensive to fetch on some clouds and not as useful as knowing whether the node is actually live and in communication. A blanket policy that shuts down nodes that are unavailable to do useful work should also catch broken nodes.
Updated by Peter Amstutz over 8 years ago · 1 revisions