Bug #4334
closed[Crunch] crunch-dispatch should not allocate Jobs to nodes in the idle* SLURM state
Description
In SLURM, "state*" means "the node was last known to be in state, but I haven't heard from it in a while." Currently, crunch-dispatch ignores the star. However, a node in the state "idle*" is usually recently crashed, and probably not usable. crunch-dispatch should not schedule work on nodes in this state.
The quick and easy implementation for this story is probably to change 'idle*' to 'down' in our database, instead of lopping off the * and making it the same as idle. No need to worry about other * states, since we only schedule onto idle nodes, so it's ok if 'down*' gets translated to 'down'. That's the safest path.
Related issues
Updated by Ward Vandewege over 9 years ago
- Target version changed from Bug Triage to Arvados Future Sprints
Updated by Brett Smith over 9 years ago
Last night, we encountered on an issue on qr1hi where SLURM was reporting nodes as idle* when they were actually down. This caused crunch-dispatch to assign work to down nodes.
Later, a job on 9tee4 failed when it tried to allocate to a busy node (9tee4-d1hrv-fsqytta8ge3pifp). We didn't catch the state at the time, but it seems likely the same basic thing happened: crunch-dispatch saw a node in idle* state when it was actually alloc.
We believe avoiding assignment to nodes in the idle* state will address both cases.
Updated by Ward Vandewege over 9 years ago
- Target version changed from Arvados Future Sprints to 2014-11-19 sprint
Updated by Ward Vandewege over 9 years ago
- Assigned To changed from Tim Pierce to Peter Amstutz
Updated by Peter Amstutz over 9 years ago
- Status changed from New to In Progress
Updated by Anonymous over 9 years ago
- Status changed from In Progress to Resolved
- % Done changed from 50 to 100
Applied in changeset arvados|commit:1bb7352bf1425dc9acf028f863eaff1e5c207571.