Bug #4334
closed
[Crunch] crunch-dispatch should not allocate Jobs to nodes in the idle* SLURM state
Added by Brett Smith over 10 years ago.
Updated over 10 years ago.
Description
In SLURM, "state*" means "the node was last known to be in state, but I haven't heard from it in a while." Currently, crunch-dispatch ignores the star. However, a node in the state "idle*" is usually recently crashed, and probably not usable. crunch-dispatch should not schedule work on nodes in this state.
The quick and easy implementation for this story is probably to change 'idle*' to 'down' in our database, instead of lopping off the * and making it the same as idle. No need to worry about other * states, since we only schedule onto idle nodes, so it's ok if 'down*' gets translated to 'down'. That's the safest path.
- Target version changed from Bug Triage to Arvados Future Sprints
Last night, we encountered on an issue on qr1hi where SLURM was reporting nodes as idle* when they were actually down. This caused crunch-dispatch to assign work to down nodes.
Later, a job on 9tee4 failed when it tried to allocate to a busy node (9tee4-d1hrv-fsqytta8ge3pifp). We didn't catch the state at the time, but it seems likely the same basic thing happened: crunch-dispatch saw a node in idle* state when it was actually alloc.
We believe avoiding assignment to nodes in the idle* state will address both cases.
- Priority changed from Normal to High
- Target version changed from Arvados Future Sprints to 2014-11-19 sprint
- Assigned To set to Tim Pierce
- Assigned To changed from Tim Pierce to Peter Amstutz
- Description updated (diff)
- Status changed from New to In Progress
- Status changed from In Progress to Resolved
- % Done changed from 50 to 100
Applied in changeset arvados|commit:1bb7352bf1425dc9acf028f863eaff1e5c207571.
Also available in: Atom
PDF