Actions
Bug #13868
closed[Node manager] Gets into trouble if nodes don't have arvados_node_size tag
Story points:
-
Release:
Release relationship:
Auto
Description
Jul 19 14:50:50 manage.e51c5.arvadosapi.com env[110136]: 2018-07-19 14:50:50 ComputeNodeUpdateActor.5af93592d98f[110137] ERROR: SLURM update ['scontrol', 'update', u'NodeName=compute138', 'Weight=9999999000', 'Features=instancetype=invalid'] failed Jul 19 14:50:50 manage.e51c5.arvadosapi.com env[110136]: Traceback (most recent call last): Jul 19 14:50:50 manage.e51c5.arvadosapi.com env[110136]: File "/usr/lib/python2.7/dist-packages/arvnodeman/computenode/dispatch/slurm.py", line 26, in _update_slurm_node Jul 19 14:50:50 manage.e51c5.arvadosapi.com env[110136]: subprocess.check_output(cmd) Jul 19 14:50:50 manage.e51c5.arvadosapi.com env[110136]: File "/usr/lib/python2.7/dist-packages/subprocess32.py", line 343, in check_output Jul 19 14:50:50 manage.e51c5.arvadosapi.com env[110136]: raise CalledProcessError(retcode, process.args, output=output) Jul 19 14:50:50 manage.e51c5.arvadosapi.com env[110136]: CalledProcessError: Command '['scontrol', 'update', u'NodeName=compute138', 'Weight=9999999000', 'Features=instancetype=invalid']' returned non-zero exit status 1. Jul 19 14:50:50 manage.e51c5.arvadosapi.com env[110136]: scontrol: error: Weight value (9999999000) is greater than 4294967280
Invalid nodes have a weight of 9999999.
Two problems:
We should make the invalid weight smaller.
If there are nodes that don't have the "arvados_node_size" tag, it is set to "None" instead of using the regular size.id
like before.
Actions