Story #4127

[API] Nodes have a method to request and record shutdowns

Added by Brett Smith almost 6 years ago. Updated 7 months ago.

Status:
Closed
Priority:
Normal
Assigned To:
-
Category:
-
Target version:
-
Start date:
07/17/2014
Due date:
07/17/2014
% Done:

0%

Estimated time:
Story points:
3.0

Description

The current Node Manager decides to shut down cloud nodes based on a node record's SLURM state. It's possible that a Node could be shut down shortly after it is allocated work. This isn't a huge loss of compute time, but it does cause a Job failure that can look mysterious at first.

It would be better if the API server provided an atomic way to request and record Node shutdowns. This has a few components:

  • Add a method to NodesController that marks a node as "being shut down" if and only if it is not currently running a Job.
  • Modify the Node model so that attempts to assign a job to it (setting job_uuid) fails if it's marked as "being shut down."
  • Modify crunch-dispatch so that it updates node assignments on the API server, and checks for OK responses, before it begins dispatching work.
  • Modify the Node Manager to request shutdowns with the API server, and only proceed after an OK response.

Related issues

Related to Arvados - Bug #4368: [Crunch] Improve node failure detection and job retry logicClosed10/31/2014

Follows Arvados - Feature #2881: [OPS] Basic node manager that can start/stop compute nodes based on demandResolved07/16/2014

History

#1 Updated by Brett Smith over 5 years ago

It's not clear that we need this now that #4380 is done.

#2 Updated by Peter Amstutz 7 months ago

  • Status changed from New to Closed

Also available in: Atom PDF