Bug #4334
closed[Crunch] crunch-dispatch should not allocate Jobs to nodes in the idle* SLURM state
Description
In SLURM, "state*" means "the node was last known to be in state, but I haven't heard from it in a while." Currently, crunch-dispatch ignores the star. However, a node in the state "idle*" is usually recently crashed, and probably not usable. crunch-dispatch should not schedule work on nodes in this state.
The quick and easy implementation for this story is probably to change 'idle*' to 'down' in our database, instead of lopping off the * and making it the same as idle. No need to worry about other * states, since we only schedule onto idle nodes, so it's ok if 'down*' gets translated to 'down'. That's the safest path.
Updated by Ward Vandewege about 10 years ago
- Target version changed from Bug Triage to Arvados Future Sprints
Updated by Brett Smith about 10 years ago
Last night, we encountered on an issue on qr1hi where SLURM was reporting nodes as idle* when they were actually down. This caused crunch-dispatch to assign work to down nodes.
Later, a job on 9tee4 failed when it tried to allocate to a busy node (9tee4-d1hrv-fsqytta8ge3pifp). We didn't catch the state at the time, but it seems likely the same basic thing happened: crunch-dispatch saw a node in idle* state when it was actually alloc.
We believe avoiding assignment to nodes in the idle* state will address both cases.
Updated by Ward Vandewege about 10 years ago
- Target version changed from Arvados Future Sprints to 2014-11-19 sprint
Updated by Ward Vandewege about 10 years ago
- Assigned To changed from Tim Pierce to Peter Amstutz
Updated by Peter Amstutz about 10 years ago
- Status changed from New to In Progress
Updated by Anonymous about 10 years ago
- Status changed from In Progress to Resolved
- % Done changed from 50 to 100
Applied in changeset arvados|commit:1bb7352bf1425dc9acf028f863eaff1e5c207571.