Feature #2881

[OPS] Basic node manager that can start/stop compute nodes based on demand

Added by Ward Vandewege over 5 years ago. Updated about 5 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Brett Smith
Category:
-
Target version:
Start date:
07/16/2014
Due date:
% Done:

100%

Estimated time:
(Total: 12.00 h)
Story points:
5.0

Subtasks

Task #3245: deploy node managerClosedWard Vandewege

Task #3239: fix update_node_attributes script and/or merge its remaining functionality into crunch-dispatchResolvedWard Vandewege

Task #3944: Review 2881-node-has-job-wipResolvedTom Clegg

Task #3248: implement basic functionalityResolvedWard Vandewege

Task #4106: Review 2881-node-manager-wipResolvedBrett Smith


Related issues

Related to Arvados - Bug #4368: [Crunch] Improve node failure detection and job retry logicClosed10/31/2014

Precedes Arvados - Story #4127: [API] Nodes have a method to request and record shutdownsNew07/17/201407/17/2014

Associated revisions

Revision da0d61f9
Added by Brett Smith about 5 years ago

Merge branch '2881-node-has-job-wip'

Refs #2881. Closes #3944.

Revision 44e01cf2
Added by Brett Smith about 5 years ago

Merge branch '2881-node-manager'

Closes #2881, #4106.

Revision 5060bf0d (diff)
Added by Brett Smith about 5 years ago

2881: run-tests tests the Node Manager.

Refs #2881.

Revision 5060bf0d (diff)
Added by Brett Smith about 5 years ago

2881: run-tests tests the Node Manager.

Refs #2881.

Revision f2695a30 (diff)
Added by Brett Smith about 5 years ago

4139: Add *.egg to Node Manager's .gitignore.

`python setup.py test` will automatically download dependencies to the
source directory if you don't already have them available in your
environment. Refs #2881, #4139.

Revision 284acd7c (diff)
Added by Brett Smith about 5 years ago

2881: Skip Node Manager tests for now.

See comment for detailed rationale. Refs #2881, #4139.

Revision 284acd7c (diff)
Added by Brett Smith about 5 years ago

2881: Skip Node Manager tests for now.

See comment for detailed rationale. Refs #2881, #4139.

Revision 6bdfed00
Added by Brett Smith about 5 years ago

Merge branch '4139-blocking-node-manager-tests-wip'

Refs #2881, #4139. Closes #4184.

History

#1 Updated by Ward Vandewege over 5 years ago

  • Assigned To set to Ward Vandewege

#2 Updated by Tom Clegg over 5 years ago

  • Status changed from New to In Progress

#3 Updated by Tom Clegg over 5 years ago

  • Target version changed from 2014-06-17 Curating and Crunch to 2014-07-16 Sprint

#4 Updated by Tom Clegg over 5 years ago

  • Story points changed from 3.0 to 2.0

#5 Updated by Ward Vandewege over 5 years ago

  • Target version changed from 2014-07-16 Sprint to 2014-08-06 Sprint

#6 Updated by Ward Vandewege over 5 years ago

  • Subject changed from Basic node manager that can start/stop compute nodes based on demand to [OPS] Basic node manager that can start/stop compute nodes based on demand

#7 Updated by Tom Clegg over 5 years ago

  • Target version changed from 2014-08-06 Sprint to Arvados Future Sprints

#8 Updated by Ward Vandewege over 5 years ago

  • Target version changed from Arvados Future Sprints to 2014-08-27 Sprint

#9 Updated by Tim Pierce over 5 years ago

  • Category set to Deployment

#10 Updated by Brett Smith over 5 years ago

  • Status changed from In Progress to New
  • Assigned To changed from Ward Vandewege to Brett Smith

#11 Updated by Tom Clegg over 5 years ago

  • Target version changed from 2014-08-27 Sprint to 2014-09-17 sprint

#12 Updated by Brett Smith over 5 years ago

  • Story points changed from 2.0 to 3.0

#13 Updated by Ward Vandewege about 5 years ago

  • Project changed from Arvados to OPS
  • Category deleted (Deployment)
  • % Done changed from 20 to 0

#14 Updated by Ward Vandewege about 5 years ago

  • Project changed from OPS to Arvados

#15 Updated by Brett Smith about 5 years ago

  • Target version changed from 2014-09-17 sprint to 2014-10-08 sprint

#16 Updated by Brett Smith about 5 years ago

Notes from discussing with Ward:

  • This just needs to be better than Ward doing it by hand.
  • Nodes have an info field :running_job_uuid to say what Job they're working on. crunch-dispatch needs to be modified to update this; see #3239.
  • We want to bring up as many Nodes as Jobs request, up to a configured maximum.
  • We should only shut down Nodes in the 5-minute window before their hourly allocation is up. The current prototype has code to fuzz things to accurately describe when the Node was created (i.e., we started paying for it). We do want to be aggressive about shutdowns. If there's no demand for a Node when the hourly window arrives, shut it down.

#17 Updated by Brett Smith about 5 years ago

  • Status changed from New to In Progress

#18 Updated by Brett Smith about 5 years ago

2881-node-has-job-wip is up for review. It implements the work in task #3239, with amendments discussed on IRC. Nodes' job assignments are stored on the API server with a Rails association. crunch-dispatch updates this information when it makes assignments.

#19 Updated by Tom Clegg about 5 years ago

Reviewing 2881-node-has-job-wip at e56010c

Mostly this looks good. It's nice to see some old cruft go away. Thank you for remembering to update the docs!

I think we should use job_uuid (or running_job_uuid) instead of job_id. Rationale:
  • We'll want to use this flag in client code, and it would be better to use the same patterns we use elsewhere for related objects. (AFAICT, crunch-dispatch.rb is capable of updating the job_id attribute only by virtue of running inside the API server: there's no API for learning the numeric ID of a job.)
  • The way pipeline instance includes :pipeline_template in its api response has turned out to be an annoying exception - for example, arv edit chokes on it because it isn't acceptable to parrot it in an "update" operation. Perhaps we can avoid sending (more) embedded objects until we have a join API?
  • Similiarly, my decision to have API tokens reference users by numeric ID instead of UUID (and not to give tokens UUIDs themselves) has turned out to be more of an annoying exception than a nice optimization, IMO.
  • You can tell belongs_to to use the job's uuid and running_job_uuid as the primary and foreign key respectively, e.g., pipeline_instance.rb.
  • {"running" => true} as "it's a secret" won't be an option if we expose a regular _uuid field, though. Perhaps just leave out that field, and let crunch_worker_status="busy" vs "idle" communicate the distinction between "no job" and "invisible job", in cases where the job is not readable by current user?

Likewise I think the Job API response should just offer node_uuids rather than bundling the node records inside the response (until we have a join API).

I feel like there might be a better way for Node to decide which job_uuids are readable by the client, but I haven't found it yet, so I'm willing to live with this if it works. It seems safe, if not exactly elegant. Another fallback would be to let the client do something like Job.where(state: 'Running').collect(&:node_uuids) to learn the visible portion of the job↔node map without admin privileges.

Doing assign_job_to_nodes() immediately after starting a salloc [...] crunch-job process may be overly optimistic. I think it's very common (especially when multiple dispatch processes exist, but even with just one) for salloc to report "requested node configuration is not available" and exit. I worry that in such cases this code will incorrectly update the nodes' job_id fields and (in the multiple-dispatch case) likely overwrite some recently-written correct data.

#20 Updated by Brett Smith about 5 years ago

Branch at 10caf4b incorporates all the review comments. I did excise some SLURM node-parsing code from crunch-dispatch, but I'm pretty sure this is safe because (a) it looks like we've been using the %n branch of sinfo this entire time, and (b) documentation on the SLURM site suggests that %n is still supported in current versions. This should generally do a better job of keeping Arvados node information in sync with SLURM.

#21 Updated by Tom Clegg about 5 years ago

LGTM, thank you.

Except: IRC verdict is it seems slightly better to leave the "don't use %n if slurm <2.3" code in for now. For posterity:
  • $ sinfo --version; sinfo --format="%n:%t" 
    slurm 2.1.11
    Invalid node format specification: n
    STATE
    down*
    ...
    
  • $ sinfo --version; sinfo --format="%n:%t" |head
    slurm 2.3.4
    HOSTNAMES:STATE
    compute1:down*
    ...
    

#22 Updated by Brett Smith about 5 years ago

More notes from Ward:

  • It seems worthwhile to have some sort of proper event loop for this, since the program's main purpose is to coordinate actions between different services.
  • The event loop should accept SIGTERM as a signal to exit, but make sure things are actually in a consistent state before actually exiting.
  • Using Fog, Ward could only figure out how to tag nodes after they came up.
  • The node manager should be idempotent. It shouldn't keep any of its own state; only get it from other services.
  • If Arvados thinks a node is down, but the cloud service thinks it's up, shut it down at the normal window—it's a sunk cost until then.
  • If the cloud service thinks a node is down, but Arvados thinks it's up, leave it alone. (Maybe send e-mail if the situation persists long enough?) It's not worth the risk of stopping compute work.
  • Known configuration knobs: authentication information, cloud service, cloud region, tag information, instance sizes to use, image to use, minimum number of nodes to have up, maximum number of nodes to allow up.

#23 Updated by Tom Clegg about 5 years ago

  • Story points changed from 3.0 to 5.0

#24 Updated by Brett Smith about 5 years ago

Branch 2881-node-manager-wip is up for review. You may want to review the Hacking Node Manager wiki page to get an overview of how it works and is organized before you dive in.

#25 Updated by Peter Amstutz about 5 years ago

General comment: (as discussed in person) without a more comprehensive design document or at least a few comments describing the various classes and methods, it's hard to evaluate whether a particular function or class is doing what it is supposed to do.

  1. Use of the bare term "size" for node size is a little bit confusing since it is a pretty generic term, ideally would like to see a search and replace "size" -> "node_size" or somesuch
  2. Similarly it would be clearer to rename arvados_node -> arvados_node_record
  3. assigned_arv -> assigned_at_arv (since it's tracking timestamps)
  4. ComputeNodeActor -> ComputeNodeRunningActor for consistency with ComputeNodeSetupActor, ComputeNodeSyncActor, ComputeNodeShutdownActor?
  5. It looks like the "size" should be listed from smallest/cheapest to biggest/most expensive in the config file for the node type selection logic to work to DTRT, but that's not documented.
  6. "In general, the daemon is response for managing the lifetime of all the ComputeNode*Actors" -> docstring or comment
  7. It's hard to keep straight all of paired_cloud, unpaired_cloud, paired_arv, unpaired_arv, and PairingTracker seems complex. Consider a simpler representation, such as a set of tuples where each tuple is either (cloud, arv) or (cloud, None) or (None, arv). In addition, the pairing data structure(s) should be managed in their own helper object instead of being all over NodeManagerDaemonActor (which has enough to do already).
  8. NodeManagerDaemonActor has both a shutdown_offer method (for shutting down nodes) and a shutdown method (for shutting down node manager), this naming is slightly confusing.

#26 Updated by Brett Smith about 5 years ago

Peter Amstutz wrote:

  1. It's hard to keep straight all of paired_cloud, unpaired_cloud, paired_arv, unpaired_arv, and PairingTracker seems complex. Consider a simpler representation, such as a set of tuples where each tuple is either (cloud, arv) or (cloud, None) or (None, arv). In addition, the pairing data structure(s) should be managed in their own helper object instead of being all over NodeManagerDaemonActor (which has enough to do already).

Added dictionary wrapper classes with dedicated methods to track this. Let me know what you think of this approach.

Thanks for this, it is much easier to follow what is going on with the node list updates and pairing now.

Everything else looks good to me.

#27 Updated by Brett Smith about 5 years ago

  • Status changed from In Progress to Resolved

Applied in changeset arvados|commit:44e01cf266a3c062b2f0f5bb3426672024367d38.

Also available in: Atom PDF