Bug #11531


[API] clean up stale/conflicting dns data from deleted node records

Added by Tom Clegg over 5 years ago. Updated over 1 year ago.

Assigned To:
Target version:
Start date:
Due date:
% Done:


Estimated time:
Story points:


Problem scenario:
  • node compute100 comes up with ip address =
  • node record compute100 is deleted
  • node compute2 comes up with ip address =
  • now, compute100 and compute2 both have dns records pointing to ... and slurm is very confused
  • any time a new node comes up with ip address, things will break ... until eventually 100 nodes come up and the compute100 conf file finally gets updated
Solution: Related cleanup feature:
  • in the "at startup, make sure all DNS entries exist" block we run at startup, check and fix other out-of-sync conditions too:
    • read the content of each existing conf file, and run dns_server_update() if it doesn't match the current IP address in the database. We only have a template for writing, not for parsing, so this can be implemented as a "skip update if existing content is identical" flag passed to dns_server_update().
    • check for extra "#{hostname}.conf" files left over from a previous config where max_compute_nodes was larger than it is now. Start at N=max_compute_nodes; increase N until "#{hostname_for_slot(N)}.conf" does not exist; then count back down, deleting the files, until N < max_compute_nodes or a node actually exists with slot number N. (Deleting in this order avoids situations like "1..128 and 196..256 exist, but 129..195 do not exist", and thereby ensures it's possible to detect excess conf files in a finite number of steps regardless of how unusual the assign_node_hostname config is.)
Actions #1

Updated by Tom Morris over 5 years ago

  • Target version set to Arvados Future Sprints
Actions #2

Updated by Peter Amstutz over 1 year ago

  • Target version deleted (Arvados Future Sprints)

Also available in: Atom PDF