Bug #11531

[API] clean up stale/conflicting dns data from deleted node records

Added by Tom Clegg 8 months ago. Updated 4 months ago.

Status:
New
Priority:
Normal
Assigned To:
-
Category:
API
Target version:
Start date:
04/20/2017
Due date:
% Done:

0%

Estimated time:
Story points:
-

Description

Problem scenario:
  • node compute100 comes up with ip address = 10.2.3.4
  • node record compute100 is deleted
  • node compute2 comes up with ip address = 10.2.3.4
  • now, compute100 and compute2 both have dns records pointing to 10.2.3.4 ... and slurm is very confused
  • any time a new node comes up with ip address 10.2.3.4, things will break ... until eventually 100 nodes come up and the compute100 conf file finally gets updated
Solution: Related cleanup feature:
  • in the "at startup, make sure all DNS entries exist" block we run at startup, check and fix other out-of-sync conditions too:
    • read the content of each existing conf file, and run dns_server_update() if it doesn't match the current IP address in the database. We only have a template for writing, not for parsing, so this can be implemented as a "skip update if existing content is identical" flag passed to dns_server_update().
    • check for extra "#{hostname}.conf" files left over from a previous config where max_compute_nodes was larger than it is now. Start at N=max_compute_nodes; increase N until "#{hostname_for_slot(N)}.conf" does not exist; then count back down, deleting the files, until N < max_compute_nodes or a node actually exists with slot number N. (Deleting in this order avoids situations like "1..128 and 196..256 exist, but 129..195 do not exist", and thereby ensures it's possible to detect excess conf files in a finite number of steps regardless of how unusual the assign_node_hostname config is.)

History

#1 Updated by Tom Morris 4 months ago

  • Target version set to Arvados Future Sprints

Also available in: Atom PDF