Project

General

Profile

Actions

Feature #6520

closed

[Node Manager] [Crunch2] Take queued containers into account when computing how many nodes should be up

Added by Tom Clegg over 8 years ago. Updated about 7 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Node Manager
Target version:
Story points:
0.5
Release:
Release relationship:
Auto

Description

Add one node to the wishlist for each queued container, just like we currently add one (or more) nodes to the wishlist for queued jobs. While Crunch v2 will support running multiple containers per node, that's less critical in the cloud: as long as we can boot approximately the right size node, there's not too much overhead in just having one node per container. And it's something we can do relatively quickly with the current Node Manager code.

This won't be perfect from a scheduling perspective, especially in the interaction between Crunch v1 and Crunch v2. We expect that Crunch v2 jobs will generally "take priority" over Crunch v1 jobs, because SLURM will dispatch them from its own queue before crunch-dispatch has a chance to look and allocate nodes. We're OK with that limitation for the time being.

Node Manager should get the list of queued containers from SLURM itself, because that's the most direct source of truth about what is waiting to run. Node Manager can get information about the runtime constraints of each container either from SLURM, or from the Containers API.

Acceptance criteria:

  • Node Manager can generate a wishlist that is informed by containers in the SLURM queue. (Whether that's the existing wishlist or a new one is an implementation detail, not an acceptance criteria either way.)
  • The node sizes in that wishlist are the smallest able to meet the runtime constraints of the respective containers.
  • The Daemon actor considers these wishlist items when deciding whether or not to boot or shut down nodes, just as it does with the wishlist generated from the job queue today.

Implementation notes:

  • Node Manager will use sinfo to determine node status (alloc/idle/drained/down) instead of using the information from the node table. A crunch v2 installation won't store node state in the nodes table, other tools like Workbench will be modified accordingly.

Subtasks 3 (0 open3 closed)

Task #11031: Review 6520-nodemanager-crunchv2ResolvedPeter Amstutz07/08/2015Actions
Task #11106: Review 6520-skip-compute0ResolvedPeter Amstutz07/08/2015Actions
Task #11061: crunch-dispatch-slurm running on cloud clustersResolvedNico César07/08/2015Actions

Related issues

Related to Arvados - Idea #6282: [Crunch] Write stories for implementation of Crunch v2ResolvedPeter Amstutz06/23/2015Actions
Blocked by Arvados - Idea #6429: [API] [Crunch2] Implement "containers" and "container requests" tables, models and controllersResolvedPeter Amstutz12/03/2015Actions
Actions

Also available in: Atom PDF