Bug #13868

[Node manager] Gets into trouble if nodes don't have arvados_node_size tag

Added by Peter Amstutz over 2 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Start date:
07/19/2018
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
-
Release:
Release relationship:
Auto

Description

Jul 19 14:50:50 manage.e51c5.arvadosapi.com env[110136]: 2018-07-19 14:50:50 ComputeNodeUpdateActor.5af93592d98f[110137] ERROR: SLURM update ['scontrol', 'update', u'NodeName=compute138', 'Weight=9999999000', 'Features=instancetype=invalid'] failed
Jul 19 14:50:50 manage.e51c5.arvadosapi.com env[110136]: Traceback (most recent call last):
Jul 19 14:50:50 manage.e51c5.arvadosapi.com env[110136]:   File "/usr/lib/python2.7/dist-packages/arvnodeman/computenode/dispatch/slurm.py", line 26, in _update_slurm_node
Jul 19 14:50:50 manage.e51c5.arvadosapi.com env[110136]:     subprocess.check_output(cmd)
Jul 19 14:50:50 manage.e51c5.arvadosapi.com env[110136]:   File "/usr/lib/python2.7/dist-packages/subprocess32.py", line 343, in check_output
Jul 19 14:50:50 manage.e51c5.arvadosapi.com env[110136]:     raise CalledProcessError(retcode, process.args, output=output)
Jul 19 14:50:50 manage.e51c5.arvadosapi.com env[110136]: CalledProcessError: Command '['scontrol', 'update', u'NodeName=compute138', 'Weight=9999999000', 'Features=instancetype=invalid']' returned non-zero exit status 1.
Jul 19 14:50:50 manage.e51c5.arvadosapi.com env[110136]: scontrol: error: Weight value (9999999000) is greater than 4294967280

Invalid nodes have a weight of 9999999.

Two problems:

We should make the invalid weight smaller.

If there are nodes that don't have the "arvados_node_size" tag, it is set to "None" instead of using the regular size.id like before.


Subtasks

Task #13870: Review 13868-invalid-node-sizeResolvedPeter Amstutz

Associated revisions

Revision 2f4a5bef
Added by Peter Amstutz over 2 years ago

Merge branch '13868-invalid-node-size' refs #13868

Arvados-DCO-1.1-Signed-off-by: Peter Amstutz <>

History

#1 Updated by Peter Amstutz over 2 years ago

  • Status changed from New to In Progress

#2 Updated by Peter Amstutz over 2 years ago

  • Description updated (diff)
  • Assigned To set to Peter Amstutz

#5 Updated by Peter Amstutz over 2 years ago

  • Status changed from In Progress to Resolved

#6 Updated by Tom Morris over 2 years ago

  • Release set to 13

Also available in: Atom PDF