Project

General

Profile

Actions

Bug #4334

closed

[Crunch] crunch-dispatch should not allocate Jobs to nodes in the idle* SLURM state

Added by Brett Smith about 10 years ago. Updated about 10 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Crunch
Target version:
Story points:
1.0

Description

In SLURM, "state*" means "the node was last known to be in state, but I haven't heard from it in a while." Currently, crunch-dispatch ignores the star. However, a node in the state "idle*" is usually recently crashed, and probably not usable. crunch-dispatch should not schedule work on nodes in this state.

The quick and easy implementation for this story is probably to change 'idle*' to 'down' in our database, instead of lopping off the * and making it the same as idle. No need to worry about other * states, since we only schedule onto idle nodes, so it's ok if 'down*' gets translated to 'down'. That's the safest path.


Subtasks 2 (0 open2 closed)

Task #4450: Review 4334-idle-star-is-downResolved10/28/2014Actions
Task #4376: Diagnose and fixResolvedPeter Amstutz10/28/2014Actions

Related issues 2 (0 open2 closed)

Related to Arvados - Bug #4314: [Crunch] Figure out why this job was marked Failed unexpectedlyResolvedPeter Amstutz10/24/2014Actions
Related to Arvados - Bug #4368: [Crunch] Improve node failure detection and job retry logicClosed10/31/2014Actions
Actions #1

Updated by Ward Vandewege about 10 years ago

  • Target version changed from Bug Triage to Arvados Future Sprints
Actions #2

Updated by Brett Smith about 10 years ago

Last night, we encountered on an issue on qr1hi where SLURM was reporting nodes as idle* when they were actually down. This caused crunch-dispatch to assign work to down nodes.

Later, a job on 9tee4 failed when it tried to allocate to a busy node (9tee4-d1hrv-fsqytta8ge3pifp). We didn't catch the state at the time, but it seems likely the same basic thing happened: crunch-dispatch saw a node in idle* state when it was actually alloc.

We believe avoiding assignment to nodes in the idle* state will address both cases.

Actions #3

Updated by Brett Smith about 10 years ago

  • Priority changed from Normal to High
Actions #4

Updated by Ward Vandewege about 10 years ago

  • Target version changed from Arvados Future Sprints to 2014-11-19 sprint
Actions #5

Updated by Tim Pierce about 10 years ago

  • Assigned To set to Tim Pierce
Actions #6

Updated by Ward Vandewege about 10 years ago

  • Assigned To changed from Tim Pierce to Peter Amstutz
Actions #7

Updated by Ward Vandewege about 10 years ago

  • Description updated (diff)
Actions #8

Updated by Peter Amstutz about 10 years ago

  • Status changed from New to In Progress
Actions #9

Updated by Anonymous about 10 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 50 to 100

Applied in changeset arvados|commit:1bb7352bf1425dc9acf028f863eaff1e5c207571.

Actions

Also available in: Atom PDF