Project

General

Profile

Actions

Bug #13868

closed

[Node manager] Gets into trouble if nodes don't have arvados_node_size tag

Added by Peter Amstutz over 6 years ago. Updated over 6 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Story points:
-
Release:
Release relationship:
Auto

Description

Jul 19 14:50:50 manage.e51c5.arvadosapi.com env[110136]: 2018-07-19 14:50:50 ComputeNodeUpdateActor.5af93592d98f[110137] ERROR: SLURM update ['scontrol', 'update', u'NodeName=compute138', 'Weight=9999999000', 'Features=instancetype=invalid'] failed
Jul 19 14:50:50 manage.e51c5.arvadosapi.com env[110136]: Traceback (most recent call last):
Jul 19 14:50:50 manage.e51c5.arvadosapi.com env[110136]:   File "/usr/lib/python2.7/dist-packages/arvnodeman/computenode/dispatch/slurm.py", line 26, in _update_slurm_node
Jul 19 14:50:50 manage.e51c5.arvadosapi.com env[110136]:     subprocess.check_output(cmd)
Jul 19 14:50:50 manage.e51c5.arvadosapi.com env[110136]:   File "/usr/lib/python2.7/dist-packages/subprocess32.py", line 343, in check_output
Jul 19 14:50:50 manage.e51c5.arvadosapi.com env[110136]:     raise CalledProcessError(retcode, process.args, output=output)
Jul 19 14:50:50 manage.e51c5.arvadosapi.com env[110136]: CalledProcessError: Command '['scontrol', 'update', u'NodeName=compute138', 'Weight=9999999000', 'Features=instancetype=invalid']' returned non-zero exit status 1.
Jul 19 14:50:50 manage.e51c5.arvadosapi.com env[110136]: scontrol: error: Weight value (9999999000) is greater than 4294967280

Invalid nodes have a weight of 9999999.

Two problems:

We should make the invalid weight smaller.

If there are nodes that don't have the "arvados_node_size" tag, it is set to "None" instead of using the regular size.id like before.


Subtasks 1 (0 open1 closed)

Task #13870: Review 13868-invalid-node-sizeResolvedPeter Amstutz07/19/2018Actions
Actions

Also available in: Atom PDF