Project

General

Profile

Bug #6157

Updated by Brett Smith almost 9 years ago

h3. background 

 Changing slurm config files, and keeping them synchronized across controller+workers, is a bit painful and can cause race conditions that are annoying to diagnose, so we try to avoid setups where it has to change during normal operation. 

 "fooN", "computeN", where N is decimal, lets you write foo[0-199] or foo[000-199] compute[0-123] in your slurm config files. Therefore, nodes.ping makes it easy to manage a setup like this. In 
 * nodes.ping automatically assigns a slot number to each node record when it sends its first ping, and sets hostname=computeN at the API server configuration, you can set @assign_node_hostname@ same time (see bug #6156). The node is expected to a corresponding format string to so that nodes that look at the ping without a response and change its hostname get one set matching to the schema, and @max_compute_nodes@ to make sure it doesn't go over assigned name. 
 * if your allocation. api's max_compute_nodes config matches your slurm config, nodes.ping will not assign hostnames that aren't defined in your slurm config. 

 *However,* in some setups it might be inconvenient/difficult/impossible to use hostnames like "fooN". computeN.* as hostnames. 

 h3. improvement 

 Install docs should include a section explaining 
 * Why foo[0-N] compute[0-N] is a good idea (see above) 
 * What to do differently if you use a different naming scheme besides string+decimal (e.g., your worker nodes' hostnames are {alice, bob, clay, ...}) 

 We should make the simplifying assumption that the hostnames are assigned manually/OOB, and known in advance. IOW, instead of covering scenarios where slurm config has to change every time a new compute node is turned up, we should just advise against that. 

 AFAIK(TC+BCS), AFAIK(TC), as long as the available/powered-on nodes' hostnames are a subset of the hostnames given in slurm.conf, and no two hosts have the same name, slurm and Arvados should work without any code changes. Even #6156 has a workaround: when you add a node record, set its slot_number at the same time you assign its hostname. 

Back