Project

General

Profile

Actions

Bug #4357

closed

[Node Manager] Should spin up more nodes (up to max_nodes) when up nodes are busy

Added by Brett Smith about 10 years ago. Updated about 10 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Node Manager
Target version:
Story points:
1.0

Description

Consider the following scenario:

  • 2 nodes are currently up.
  • A job that needs two nodes is created, assigned to the 2 available nodes, and started.
  • A second job that needs two nodes is created.

Node Manager does not create more nodes in this case: it sees that two nodes are available, and that the queue is requesting two servers, so it considers the request satisfied. Nodes that are currently running jobs should not be included in the count of available nodes, so that Node Manager creates more in this case.


Subtasks 1 (0 open1 closed)

Task #4370: Review 4357-node-manager-busy-nodes-wipResolvedBrett Smith10/31/2014Actions
Actions #1

Updated by Ward Vandewege about 10 years ago

  • Target version changed from Bug Triage to 2014-11-19 sprint
Actions #2

Updated by Brett Smith about 10 years ago

  • Assigned To set to Brett Smith
Actions #3

Updated by Brett Smith about 10 years ago

  • Status changed from New to In Progress
Actions #4

Updated by Tom Clegg about 10 years ago

Looking at b0e6de3, just a few suspicious characters:

ComputeNodeMonitorActor's state() method looks like it will return ALLOC if arvados_node['info']['slurm_state'] is 'down'. This means a node slurm calls "down" is not eligible for shutdown, which doesn't sound right. The old code seems to make the same decision, though, so this seems to be an unrelated bug if it's a bug at all. Should there be a DOWN state -- or BROKEN, to disambiguate from "turned off"? More generally, using ALLOC instead of UNKNOWN as the fall-through case feels a bit weird: if slurm_state is "blergh", for example, we're surely not going to schedule jobs on it, so perhaps it's better to call it UNKNOWN and shut it down?

_shutdown_eligible()'s state==UNKNOWN case looks like it's meant to catch nodes which were created but never got paired. However, it says and timestamp_fresh -- should that be and not timestamp_fresh? (Either way I think this case could benefit from a wee comment.) Again, this looks like it preserves existing behavior.

NodeManagerDaemonActor's _nodes_excess() method looks like it doesn't (but should) take allocated nodes into account. Perhaps test_shutdown_* in test_daemon.py should also test some cases like "1 alloc, 1 idle, 1 job in queue → declined" to prove this one way or the other?

Actions #5

Updated by Brett Smith about 10 years ago

Tom Clegg wrote:

Looking at b0e6de3, just a few suspicious characters:

ComputeNodeMonitorActor's state() method looks like it will return ALLOC if arvados_node['info']['slurm_state'] is 'down'. This means a node slurm calls "down" is not eligible for shutdown, which doesn't sound right. The old code seems to make the same decision, though, so this seems to be an unrelated bug if it's a bug at all. Should there be a DOWN state -- or BROKEN, to disambiguate from "turned off"? More generally, using ALLOC instead of UNKNOWN as the fall-through case feels a bit weird: if slurm_state is "blergh", for example, we're surely not going to schedule jobs on it, so perhaps it's better to call it UNKNOWN and shut it down?

_shutdown_eligible()'s state==UNKNOWN case looks like it's meant to catch nodes which were created but never got paired. However, it says and timestamp_fresh -- should that be and not timestamp_fresh? (Either way I think this case could benefit from a wee comment.) Again, this looks like it preserves existing behavior.

The code is correct. The rationale here is that if it's a new unpaired node, then it's most likely that bootstrapping failed somehow, and we should just go ahead and shut it down. If it's an old unpaired node, then there are any number of possible explanations (network trouble, someone modified/deleted the underlying node record, etc.), but it might be doing useful compute work right now, so leave it alone. See the comment about "node stale time" in doc/ec2.example.cfg. I've also added a comment to the code.

This is the same basic motivation behind the state() implementation. We only want to consider a node idle (and therefore eligible for shutdown) if it is very actively affirming that it's in the idle state. If SLURM thinks the node is down, but the cloud thinks it's up, then something's weird and a sysadmin should investigate and manually resolve it. The math still works out the way we want it: if there's one node in the SLURM down state, and one job in the queue, this version of Node Manager will start a new node to replace it, up to max_nodes. If you think we should proactively shut down nodes that are SLURM down, please file a ticket for that.

I understand that the constant naming makes things confusing, though. I probably shouldn't borrow the SLURM state names since that's not what we actually care about. What about NONIDLE? Got a better suggestion?

NodeManagerDaemonActor's _nodes_excess() method looks like it doesn't (but should) take allocated nodes into account. Perhaps test_shutdown_* in test_daemon.py should also test some cases like "1 alloc, 1 idle, 1 job in queue → declined" to prove this one way or the other?

You're right. I thought about this case but obviously thought about it wrong. I should've done less thinking, more test writing. :) Fixed in be60ff5.

Actions #6

Updated by Tom Clegg about 10 years ago

Brett Smith wrote:

The code is correct. The rationale here is that if it's a new unpaired node, then it's most likely that bootstrapping failed somehow, and we should just go ahead and shut it down. If it's an old unpaired node, then there are any number of possible explanations (network trouble, someone modified/deleted the underlying node record, etc.), but it might be doing useful compute work right now, so leave it alone. See the comment about "node stale time" in doc/ec2.example.cfg. I've also added a comment to the code.

Ah, I see. I was expecting "new" to mean "booting": we asked the cloud for a new node, but it hasn't been paired yet. But it really means something more like "not old enough to have become unpaired as a result of unresponsiveness".

I was worried that this created a subtle dependency on "time for a requested node to boot up and become ready/idle/paired" being short enough that a new node can't arrive in the "economically sensible shutdown window" before it even gets paired. Thinking about it more, that sounds correct: we won't even be asking whether anyone is _shutdown_eligible() if we actually need that node for something. And if it reaches the shutdown window and we don't need it, who cares whether it finished booting?

I understand that the constant naming makes things confusing, though. I probably shouldn't borrow the SLURM state names since that's not what we actually care about. What about NONIDLE? Got a better suggestion?

I suppose the labels should reflect the categories that determine node manager's behavior. IDLE here suggests it's wasted/available (depending on what's in the job queue!) which seems right. If UNKNOWN were split into BOOTING and UNRESPONSIVE, would that cover the cases? Or even BOOTING and UNKNOWN?

You're right. I thought about this case but obviously thought about it wrong. I should've done less thinking, more test writing. :) Fixed in be60ff5.

Looks good.

One other minor thing: The names of these methods suggest one is len(the other) but that's not quite true. Would it be clearer to call the second method monitor_count_alive()?
  • +    def monitor_list(self):
    +        return pykka.ActorRegistry.get_by_class(nmcnode.ComputeNodeMonitorActor)
    +
    +    def monitor_count(self):
    +        return sum(1 for actor in self.monitor_list() if actor.is_alive())
    +
    
Actions #7

Updated by Brett Smith about 10 years ago

Tom Clegg wrote:

I was worried that this created a subtle dependency on "time for a requested node to boot up and become ready/idle/paired" being short enough that a new node can't arrive in the "economically sensible shutdown window" before it even gets paired. Thinking about it more, that sounds correct: we won't even be asking whether anyone is _shutdown_eligible() if we actually need that node for something. And if it reaches the shutdown window and we don't need it, who cares whether it finished booting?

You're describing the current behavior pretty accurately. For what it's worth, #4293 is expected to change this somewhat, so that a node that doesn't pair after a certain time will be proactively shut down no matter what, on the theory that it's not going to be useful. But that will require a new configuration variable, so it'll be clearer how you're defining windows of time.

More generally, it's definitely possible to get Node Manager to do all kinds of goofy things if you give it a quirky configuration. That's something I think it might be useful to check for and at least warn about, but since the target audience for Node Manager is "sysadmins savvy enough to dynamically scale their clouds," it seems low-priority.

I understand that the constant naming makes things confusing, though. I probably shouldn't borrow the SLURM state names since that's not what we actually care about. What about NONIDLE? Got a better suggestion?

I suppose the labels should reflect the categories that determine node manager's behavior. IDLE here suggests it's wasted/available (depending on what's in the job queue!) which seems right. If UNKNOWN were split into BOOTING and UNRESPONSIVE, would that cover the cases? Or even BOOTING and UNKNOWN?

… okay, you know what? State flags seem like more trouble than they're worth. Unless you are literally implementing a state machine, it seems like it's always a challenge to define the right set of states that both provide the information you're interested in, and name them in that specific brief-yet-descriptive way.

So forget this. I have replaced state() with in_state(), which lets you query the SLURM state more generally, returning None if the Arvados node record is unavailable or stale. This lets us address the issue with practically the same code client-side, but it's hopefully clearer what's going on and this should be more extensible for future development.

One other minor thing: The names of these methods suggest one is len(the other) but that's not quite true. Would it be clearer to call the second method monitor_count_alive()?

Sure, as alive_monitor_count. Ready for another look at 39af523.

Actions #8

Updated by Tom Clegg about 10 years ago

Yup, this is easier to follow. LGTM.

Actions #9

Updated by Brett Smith about 10 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 0 to 100

Applied in changeset arvados|commit:8ad92bb9e7950e0bf758716b40764d26ee33802c.

Actions

Also available in: Atom PDF