Bug #12073

[Node manager] Clean up stale arvados node records

Added by Peter Amstutz 4 months ago. Updated about 1 month ago.

Status:ResolvedStart date:09/20/2017
Priority:NormalDue date:
Assignee:Lucas Di Pentima% Done:

100%

Category:-
Target version:2017-10-11 Sprint
Story points-Remaining (hours)0.00 hour
Velocity based estimate-

Description

If a node goes away without node manager doing clean_arvados_node(), it will remain in the nodes table tying up a slot.

Node manager should detect when there is a node in the nodes table that doesn't correspond to a real cloud node and clean up the stale node record.


Subtasks

Task #12201: Review 12073-nodemanager-stale-nodes-recsResolvedPeter Amstutz

Associated revisions

Revision 1339298d
Added by Lucas Di Pentima about 1 month ago

Merge branch '12073-nodemanager-stale-nodes-recs-bis'
Closes #12073

Arvados-DCO-1.1-Signed-off-by: Lucas Di Pentima <>

History

#1 Updated by Peter Amstutz 4 months ago

  • Description updated (diff)

#2 Updated by Tom Morris 3 months ago

  • Target version set to 2017-08-30 Sprint

#3 Updated by Peter Amstutz 3 months ago

  • Target version changed from 2017-08-30 Sprint to 2017-09-13 Sprint

#4 Updated by Peter Amstutz 3 months ago

  • Target version changed from 2017-09-13 Sprint to 2017-08-30 Sprint

#5 Updated by Tom Morris 3 months ago

  • Target version changed from 2017-08-30 Sprint to 2017-09-13 Sprint

#6 Updated by Lucas Di Pentima 3 months ago

  • Assignee set to Lucas Di Pentima

#7 Updated by Lucas Di Pentima 2 months ago

  • Status changed from New to In Progress

#8 Updated by Lucas Di Pentima 2 months ago

Following a related ticket note: https://dev.arvados.org/issues/12064#note-2

Is it safe to assume that the correct way to detect dangling arvados node records should be the following?:

  • Ask the api client for all node records with a slot_number assigned
  • For every node record, check using its status with sinfo
  • If the status is drain*, reset the node record

This would be implemented on a on_start() method on the ArvadosNodeListMonitorActor class.

#9 Updated by Peter Amstutz 2 months ago

Lucas Di Pentima wrote:

Following a related ticket note: https://dev.arvados.org/issues/12064#note-2

Is it safe to assume that the correct way to detect dangling arvados node records should be the following?:

  • Ask the api client for all node records with a slot_number assigned
  • For every node record, check using its status with sinfo
  • If the status is drain*, reset the node record

This would be implemented on a on_start() method on the ArvadosNodeListMonitorActor class.

This looks about right. One additional consideration is you should make sure the record is "stale" (the node record in the arvados table hasn't been touched for at least node_stale_after seconds, by default this is two hours.)

#10 Updated by Lucas Di Pentima 2 months ago

Is the nodemanager periodically restarted? If some cloud node goes down and the nodemanager immediately restarts before cleaning up its record, that record won't be stale and won't get cleaned up unless nodemanager get restarted later on.

#11 Updated by Peter Amstutz 2 months ago

Lucas Di Pentima wrote:

Is the nodemanager periodically restarted? If some cloud node goes down and the nodemanager immediately restarts before cleaning up its record, that record won't be stale and won't get cleaned up unless nodemanager get restarted later on.

So maybe instead of running once on startup, this should be implemented as an Actor which runs every couple of hours and sleeps in between?

#12 Updated by Peter Amstutz 2 months ago

Peter Amstutz wrote:

Lucas Di Pentima wrote:

Is the nodemanager periodically restarted? If some cloud node goes down and the nodemanager immediately restarts before cleaning up its record, that record won't be stale and won't get cleaned up unless nodemanager get restarted later on.

So maybe instead of running once on startup, this should be implemented as an Actor which runs every couple of hours and sleeps in between?

(04:09:07 PM) tetron: lucas: actually, I think the right place might be in daemon.py in update_arvados_nodes
(04:09:32 PM) tetron: that's where it actually has access to the list of both cloud nodes and arvados nodes
(04:09:57 PM) tetron: and can determine if there's an arvados node which is tying up a slot, but doesn't have a corresponding cloud node, and is stale
(04:15:04 PM) lucas: tetron: Ok, I think that method is the one that it's executed when the ArvadosNodeListMonitorActor poller returns the nodelist. I should check those with a crunch_worker_state == 'down' in that case because the 'drain*' state isn't taken as a special case on _send_request().
(04:16:05 PM) tetron: yes
(04:18:39 PM) tetron: yes, if a node record is down, and stale, but has a slot assigned, it needs to be cleaned
(04:20:07 PM) tetron: lucas: you'll want to use something like ComputeNodeUpdateActor to do the actual Arvados request to avoid blocking the daemon thread

#13 Updated by Lucas Di Pentima 2 months ago

  • Target version changed from 2017-09-13 Sprint to 2017-09-27 Sprint

#14 Updated by Lucas Di Pentima 2 months ago

Updates at 0f1bac0cc - branch 12073-nodemanager-stale-nodes-recs
Test run: https://ci.curoverse.com/job/developer-run-tests/463/

  • Added a new actor that takes care of sending the arvados records update requests to the API Server when the NodeManagerDaemonActor decides that some stale node record should be cleaned up.
  • Added related tests.

#15 Updated by Lucas Di Pentima 2 months ago

Re-ran tests to see if they still failed, but they finished ok: https://ci.curoverse.com/job/developer-run-tests/464/

#16 Updated by Peter Amstutz about 1 month ago

            if node['crunch_worker_state'] == 'down' and \
               node['slot_number'] and \
               arvados_node_missing(node, self.node_stale_after):

Should be node['slot_number'] is None, because bool(node['slot_number']) when node['slot_number'] == 0 evaluates to False.

In doc/*.cfg there is a comment about the value of "Node stale time":

# "Node stale time" affects two related behaviors.
# 1. If a compute node has been running for at least this long, but it
# isn't paired with an Arvados node, do not shut it down, but leave it alone.
# This prevents the node manager from shutting down a node that might
# actually be doing work, but is having temporary trouble contacting the
# API server.
# 2. When the Node Manager starts a new compute node, it will try to reuse
# an Arvados node that hasn't been updated for this long.

Part (1.) is wrong (it may have done that at some point, but I'm pretty sure doesn't do that any more.) Then it needs to mention the new behavior of clearing stale node records.

There's a copy of these sample configuration files in arvados/doc/install/install-nodemanager.html.textile.liquid which unfortunately makes six places the comment needs to be updated :-/ Maybe the arvados/doc/_includes could have a relative symlink to services/nodemanager/docs/*.cfg and then use {% include 'azure.example.cfg' %}.

It feels like there is a potential race between _arvados_node_cleaner.clean_node() and ComputeNodeSetupActor. Example: a call to clean_node() is delayed, meanwhile the record is assigned to ComputeNodeSetupActor and paired, then clean_node() comes in later and wipes the node record. The find_stale_node() method should check slot_numer is None to ensure that it doesn't assign a record that might be cleaned later.

On further consideration, the clean_node() process should probably be the main process responsible for cleaning stale node record, find_stale_node() should only pick up cleaned nodes, and ComputeNodeSetupActor shouldn't need to call _clean_arvados_node() at all.

Similarly, the daemon callback method node_finished_shutdown could call clean_node() on a successful shutdown instead of having it be part of the ComputeNodeShutdownActor.

#17 Updated by Peter Amstutz about 1 month ago

So to clarify:

  • A node record should get cleaned by NodeManagerDaemonActor.update_arvados_nodes when modified_at is older than node_stale_time, crunch_worker_state in ('down', None) and any of slot_number, hostname, ip_address, first_ping_at, last_ping_at are not None.
  • The node record also gets cleaned immediately by NodeManagerDaemonActor.node_finished_shutdown on successful shutdown.
  • A stale record is candidate for reuse when modified_at is older than node_stale_time, crunch_worker_state in ('down', None) and all of @slot_number, hostname, ip_address, first_ping_at, last_ping_at are None.
  • ComputeNodeSetupActor and ComputeNodeShutdownActor should not longer clean node records. (ComputeNodeSetupActor may still create a new one if no record is provided to it).

#18 Updated by Tom Morris about 1 month ago

I've also seen this table with hundreds of records with a state of 'startup-fail'. Are these candidates for cleanup too?

#19 Updated by Peter Amstutz about 1 month ago

On further thought, I realize there's a reason the SetupActor performs the record cleaning, which we need to maintain.

Node records effectively have the following states:

  1. Unassigned: node record is available for use
  2. Assigned: node record is reserved to a node that is being created, but hasn't pinged yet
  3. In use: node is assigned a slot and pinging the API server every minute
  4. Not in use: recently shut down, slot is null
  5. Stale: node has not pinged in some time, but is taking up a slot

#20 Updated by Lucas Di Pentima about 1 month ago

Updates at 6ec328c00
Test run: https://ci.curoverse.com/job/developer-run-tests/468/

  • Updated documentation to include nodemanager cloud config files examples from services/nodemanager/doc/ directory.
  • Updated criteria by which an arvados node record is considered stale.
  • Added node record cleanup call on successful cloud node shutdown.
  • Added related test.

#21 Updated by Peter Amstutz about 1 month ago

Sorry, I had to cut the previous thought short.

Currently the "clean" behavior of SetupActor has the effect of touching the Node record (updating modified_at). Because the node record reuse logic picks a record where both modified_at is at least 2 hours old and slot is null, this protects us from assigning the record to another node even if node manager is restarted.

If node creation fails, the node record won't be available to reassign for 2 hours (if a lot of node creation attempts fail, it can result in creating 1000s of extra node records.)

The slot is assigned when the nodes successfully pings.

The "clean" behavior of ShutdownActor is currently responsible for releasing the slot. If node manager is restarted between sending the request to destroy the the node and cleaning the record, then the slot remains non-null and the record can't be reused. Also, because modified_at is updated, even cleaned node records for recently shut down nodes won't be reused for two hours.

Effectively the current node manager treats nodes as having a two-hour lock, and avoids modifying nodes which have been modified more recently than that, with the single exception of nodes that have been explicitly shut down.

So desired node record states are:

  • Cleaned: "clean" state is indicated, safe to assign
  • Assigned: "assigned" state is indicated, recently modified, slot is null
  • In use: "assigned" state is indicated, recently modified, slot is set
  • Stale: "assigned" state is indicated, not recently modified, should be cleaned

The SetupActor should put the node record into the "Assigned" state, the ShutdownActor should put the record back into the "Cleaned" state. We don't have explicit a state field on the Node record, but could define it through some combination of fields.

#22 Updated by Lucas Di Pentima about 1 month ago

  • Target version changed from 2017-09-27 Sprint to 2017-10-11 Sprint

#23 Updated by Lucas Di Pentima about 1 month ago

Updates at 96da34b18 - new branch: 12073-nodemanager-stale-nodes-recs-bis
Test run: https://ci.curoverse.com/job/developer-run-tests/473/

After some thought, I came up with a fairly simple fix, that I believe avoid race conditions and don't try to solve the bigger problem that causes this issue.

I think the main cause of this is that nodemanager only creates or updates node records (never deletes), and because of the 2 hour window that any record has before being considered as stale, there are times when problems make the nodemanager create new records because it doesn't have any stale ones (all modified_at recently, for example) so the node record count grows over time.
To this previous issue we have to add the fact that sometimes a node record isn't cleaned after a node shutdown, and it is kept on the database as stale and with a slot number assigned, mixed with lots of other stale and slot_number-less.

So, what I believe is happening is that find_stale_node() is randomically returning the first (of a lot) stale node, and over time the ones with a slot number are more of a minority against the other kind, until there are no more slots and node creation start to fail.

Quick Solution: I've updated find_stale_node() to sort the node record list before checking for the first stale one, prioritizing those with a slot number assigned.

We'll have to think how to safely remove those stale node records from the api server, but I think this can help avoid the issue reported here.

#24 Updated by Peter Amstutz about 1 month ago

Lucas Di Pentima wrote:

Updates at 96da34b18 - new branch: 12073-nodemanager-stale-nodes-recs-bis
Test run: https://ci.curoverse.com/job/developer-run-tests/473/

After some thought, I came up with a fairly simple fix, that I believe avoid race conditions and don't try to solve the bigger problem that causes this issue.

I think the main cause of this is that nodemanager only creates or updates node records (never deletes), and because of the 2 hour window that any record has before being considered as stale, there are times when problems make the nodemanager create new records because it doesn't have any stale ones (all modified_at recently, for example) so the node record count grows over time.
To this previous issue we have to add the fact that sometimes a node record isn't cleaned after a node shutdown, and it is kept on the database as stale and with a slot number assigned, mixed with lots of other stale and slot_number-less.

So, what I believe is happening is that find_stale_node() is randomically returning the first (of a lot) stale node, and over time the ones with a slot number are more of a minority against the other kind, until there are no more slots and node creation start to fail.

Quick Solution: I've updated find_stale_node() to sort the node record list before checking for the first stale one, prioritizing those with a slot number assigned.

We'll have to think how to safely remove those stale node records from the api server, but I think this can help avoid the issue reported here.

While this doesn't totally avoid the situation of booting nodes that can't be assigned slots (since there is a 2 hour window where a slot might remain tied up), it should mitigate the problem a bit. LGTM @ 96da34b1888638648c291d989d990b4e5738db88

#25 Updated by Anonymous about 1 month ago

  • Status changed from In Progress to Resolved

Applied in changeset arvados|commit:1339298d5812df668e08d9e77d595012cffd3171.

Also available in: Atom PDF