Bug #13868

Updated by Peter Amstutz over 2 years ago

<pre>
Jul 19 14:50:50 manage.e51c5.arvadosapi.com env[110136]: 2018-07-19 14:50:50 ComputeNodeUpdateActor.5af93592d98f[110137] ERROR: SLURM update ['scontrol', 'update', u'NodeName=compute138', 'Weight=9999999000', 'Features=instancetype=invalid'] failed
Jul 19 14:50:50 manage.e51c5.arvadosapi.com env[110136]: Traceback (most recent call last):
Jul 19 14:50:50 manage.e51c5.arvadosapi.com env[110136]: File "/usr/lib/python2.7/dist-packages/arvnodeman/computenode/dispatch/slurm.py", line 26, in _update_slurm_node
Jul 19 14:50:50 manage.e51c5.arvadosapi.com env[110136]: subprocess.check_output(cmd)
Jul 19 14:50:50 manage.e51c5.arvadosapi.com env[110136]: File "/usr/lib/python2.7/dist-packages/subprocess32.py", line 343, in check_output
Jul 19 14:50:50 manage.e51c5.arvadosapi.com env[110136]: raise CalledProcessError(retcode, process.args, output=output)
Jul 19 14:50:50 manage.e51c5.arvadosapi.com env[110136]: CalledProcessError: Command '['scontrol', 'update', u'NodeName=compute138', 'Weight=9999999000', 'Features=instancetype=invalid']' returned non-zero exit status 1.
Jul 19 14:50:50 manage.e51c5.arvadosapi.com env[110136]: scontrol: error: Weight value (9999999000) is greater than 4294967280
</pre>

Invalid nodes have a weight of 9999999.

Two problems:

We should make the invalid weight smaller.

If there are nodes that don't have the "arvados_node_size" tag, it is set to "None" instead of using the regular @size.id@ like before.

Back