Bug #12073

[Node manager] Clean up stale arvados node records

Added by Peter Amstutz about 1 month ago. Updated 4 days ago.

Status:In ProgressStart date:09/20/2017
Priority:NormalDue date:
Assignee:Lucas Di Pentima% Done:

0%

Category:-
Target version:2017-09-27 Sprint
Story points-Remaining (hours)0.00 hour
Velocity based estimate-

Description

If a node goes away without node manager doing clean_arvados_node(), it will remain in the nodes table tying up a slot.

Node manager should detect when there is a node in the nodes table that doesn't correspond to a real cloud node and clean up the stale node record.


Subtasks

Task #12201: Review 12073-nodemanager-stale-nodes-recsIn ProgressLucas Di Pentima

History

#1 Updated by Peter Amstutz about 1 month ago

  • Description updated (diff)

#2 Updated by Tom Morris about 1 month ago

  • Target version set to 2017-08-30 Sprint

#3 Updated by Peter Amstutz about 1 month ago

  • Target version changed from 2017-08-30 Sprint to 2017-09-13 Sprint

#4 Updated by Peter Amstutz about 1 month ago

  • Target version changed from 2017-09-13 Sprint to 2017-08-30 Sprint

#5 Updated by Tom Morris 25 days ago

  • Target version changed from 2017-08-30 Sprint to 2017-09-13 Sprint

#6 Updated by Lucas Di Pentima 25 days ago

  • Assignee set to Lucas Di Pentima

#7 Updated by Lucas Di Pentima 18 days ago

  • Status changed from New to In Progress

#8 Updated by Lucas Di Pentima 11 days ago

Following a related ticket note: https://dev.arvados.org/issues/12064#note-2

Is it safe to assume that the correct way to detect dangling arvados node records should be the following?:

  • Ask the api client for all node records with a slot_number assigned
  • For every node record, check using its status with sinfo
  • If the status is drain*, reset the node record

This would be implemented on a on_start() method on the ArvadosNodeListMonitorActor class.

#9 Updated by Peter Amstutz 11 days ago

Lucas Di Pentima wrote:

Following a related ticket note: https://dev.arvados.org/issues/12064#note-2

Is it safe to assume that the correct way to detect dangling arvados node records should be the following?:

  • Ask the api client for all node records with a slot_number assigned
  • For every node record, check using its status with sinfo
  • If the status is drain*, reset the node record

This would be implemented on a on_start() method on the ArvadosNodeListMonitorActor class.

This looks about right. One additional consideration is you should make sure the record is "stale" (the node record in the arvados table hasn't been touched for at least node_stale_after seconds, by default this is two hours.)

#10 Updated by Lucas Di Pentima 11 days ago

Is the nodemanager periodically restarted? If some cloud node goes down and the nodemanager immediately restarts before cleaning up its record, that record won't be stale and won't get cleaned up unless nodemanager get restarted later on.

#11 Updated by Peter Amstutz 11 days ago

Lucas Di Pentima wrote:

Is the nodemanager periodically restarted? If some cloud node goes down and the nodemanager immediately restarts before cleaning up its record, that record won't be stale and won't get cleaned up unless nodemanager get restarted later on.

So maybe instead of running once on startup, this should be implemented as an Actor which runs every couple of hours and sleeps in between?

#12 Updated by Peter Amstutz 11 days ago

Peter Amstutz wrote:

Lucas Di Pentima wrote:

Is the nodemanager periodically restarted? If some cloud node goes down and the nodemanager immediately restarts before cleaning up its record, that record won't be stale and won't get cleaned up unless nodemanager get restarted later on.

So maybe instead of running once on startup, this should be implemented as an Actor which runs every couple of hours and sleeps in between?

(04:09:07 PM) tetron: lucas: actually, I think the right place might be in daemon.py in update_arvados_nodes
(04:09:32 PM) tetron: that's where it actually has access to the list of both cloud nodes and arvados nodes
(04:09:57 PM) tetron: and can determine if there's an arvados node which is tying up a slot, but doesn't have a corresponding cloud node, and is stale
(04:15:04 PM) lucas: tetron: Ok, I think that method is the one that it's executed when the ArvadosNodeListMonitorActor poller returns the nodelist. I should check those with a crunch_worker_state == 'down' in that case because the 'drain*' state isn't taken as a special case on _send_request().
(04:16:05 PM) tetron: yes
(04:18:39 PM) tetron: yes, if a node record is down, and stale, but has a slot assigned, it needs to be cleaned
(04:20:07 PM) tetron: lucas: you'll want to use something like ComputeNodeUpdateActor to do the actual Arvados request to avoid blocking the daemon thread

#13 Updated by Lucas Di Pentima 10 days ago

  • Target version changed from 2017-09-13 Sprint to 2017-09-27 Sprint

#14 Updated by Lucas Di Pentima 5 days ago

Updates at 0f1bac0cc - branch 12073-nodemanager-stale-nodes-recs
Test run: https://ci.curoverse.com/job/developer-run-tests/463/

  • Added a new actor that takes care of sending the arvados records update requests to the API Server when the NodeManagerDaemonActor decides that some stale node record should be cleaned up.
  • Added related tests.

#15 Updated by Lucas Di Pentima 4 days ago

Re-ran tests to see if they still failed, but they finished ok: https://ci.curoverse.com/job/developer-run-tests/464/

#16 Updated by Peter Amstutz 4 days ago

            if node['crunch_worker_state'] == 'down' and \
               node['slot_number'] and \
               arvados_node_missing(node, self.node_stale_after):

Should be node['slot_number'] is None, because bool(node['slot_number']) when node['slot_number'] == 0 evaluates to False.

In doc/*.cfg there is a comment about the value of "Node stale time":

# "Node stale time" affects two related behaviors.
# 1. If a compute node has been running for at least this long, but it
# isn't paired with an Arvados node, do not shut it down, but leave it alone.
# This prevents the node manager from shutting down a node that might
# actually be doing work, but is having temporary trouble contacting the
# API server.
# 2. When the Node Manager starts a new compute node, it will try to reuse
# an Arvados node that hasn't been updated for this long.

Part (1.) is wrong (it may have done that at some point, but I'm pretty sure doesn't do that any more.) Then it needs to mention the new behavior of clearing stale node records.

There's a copy of these sample configuration files in arvados/doc/install/install-nodemanager.html.textile.liquid which unfortunately makes six places the comment needs to be updated :-/ Maybe the arvados/doc/_includes could have a relative symlink to services/nodemanager/docs/*.cfg and then use {% include 'azure.example.cfg' %}.

It feels like there is a potential race between _arvados_node_cleaner.clean_node() and ComputeNodeSetupActor. Example: a call to clean_node() is delayed, meanwhile the record is assigned to ComputeNodeSetupActor and paired, then clean_node() comes in later and wipes the node record. The find_stale_node() method should check slot_numer is None to ensure that it doesn't assign a record that might be cleaned later.

On further consideration, the clean_node() process should probably be the main process responsible for cleaning stale node record, find_stale_node() should only pick up cleaned nodes, and ComputeNodeSetupActor shouldn't need to call _clean_arvados_node() at all.

Similarly, the daemon callback method node_finished_shutdown could call clean_node() on a successful shutdown instead of having it be part of the ComputeNodeShutdownActor.

#17 Updated by Peter Amstutz 4 days ago

So to clarify:

  • A node record should get cleaned by NodeManagerDaemonActor.update_arvados_nodes when modified_at is older than node_stale_time, crunch_worker_state in ('down', None) and any of slot_number, hostname, ip_address, first_ping_at, last_ping_at are not None.
  • The node record also gets cleaned immediately by NodeManagerDaemonActor.node_finished_shutdown on successful shutdown.
  • A stale record is candidate for reuse when modified_at is older than node_stale_time, crunch_worker_state in ('down', None) and all of @slot_number, hostname, ip_address, first_ping_at, last_ping_at are None.
  • ComputeNodeSetupActor and ComputeNodeShutdownActor should not longer clean node records. (ComputeNodeSetupActor may still create a new one if no record is provided to it).

#18 Updated by Tom Morris 4 days ago

I've also seen this table with hundreds of records with a state of 'startup-fail'. Are these candidates for cleanup too?

Also available in: Atom PDF