Project

General

Profile

Actions

Idea #4127

closed

[API] Nodes have a method to request and record shutdowns

Added by Brett Smith over 9 years ago. Updated about 4 years ago.

Status:
Closed
Priority:
Normal
Assigned To:
-
Category:
-
Target version:
-
Story points:
3.0

Description

The current Node Manager decides to shut down cloud nodes based on a node record's SLURM state. It's possible that a Node could be shut down shortly after it is allocated work. This isn't a huge loss of compute time, but it does cause a Job failure that can look mysterious at first.

It would be better if the API server provided an atomic way to request and record Node shutdowns. This has a few components:

  • Add a method to NodesController that marks a node as "being shut down" if and only if it is not currently running a Job.
  • Modify the Node model so that attempts to assign a job to it (setting job_uuid) fails if it's marked as "being shut down."
  • Modify crunch-dispatch so that it updates node assignments on the API server, and checks for OK responses, before it begins dispatching work.
  • Modify the Node Manager to request shutdowns with the API server, and only proceed after an OK response.

Related issues

Related to Arvados - Bug #4368: [Crunch] Improve node failure detection and job retry logicClosed10/31/2014Actions
Follows Arvados - Feature #2881: [OPS] Basic node manager that can start/stop compute nodes based on demandResolvedBrett Smith07/16/2014Actions
Actions #1

Updated by Brett Smith about 9 years ago

It's not clear that we need this now that #4380 is done.

Actions #2

Updated by Peter Amstutz about 4 years ago

  • Status changed from New to Closed
Actions

Also available in: Atom PDF