Bug #11254
closed[Node manager] backtrace on node shutdown
Assigned To:
Node Manager
Target version:
Story points:
After upgrading c97qk to Xenial, trying to run a test job it gets queued and never runs. Checking arvados-node-manager logs, I find lots of these:
2017-03-15_15:19:11.28897 2017-03-15 15:19:11 NodeManagerDaemonActor.0da41a4cb360[26201] ERROR: while calculating nodes wanted for size <arvnodeman.jobqueue.CloudSizeWrapper object at 0x7fb1346919d0> 2017-03-15_15:19:11.28900 Traceback (most recent call last): 2017-03-15_15:19:11.28900 File "/usr/lib/python2.7/dist-packages/arvnodeman/", line 332, in update_server_wishlist 2017-03-15_15:19:11.28903 nodes_wanted = self._nodes_wanted(size) 2017-03-15_15:19:11.28904 File "/usr/lib/python2.7/dist-packages/arvnodeman/", line 290, in _nodes_wanted 2017-03-15_15:19:11.28904 counts = self._state_counts(size) 2017-03-15_15:19:11.28905 File "/usr/lib/python2.7/dist-packages/arvnodeman/", line 256, in _state_counts 2017-03-15_15:19:11.28906 states = self._node_states(size) 2017-03-15_15:19:11.28908 File "/usr/lib/python2.7/dist-packages/arvnodeman/", line 247, in _node_states 2017-03-15_15:19:11.28908 for rec in self.cloud_nodes.nodes.itervalues() 2017-03-15_15:19:11.28909 File "/usr/lib/python2.7/dist-packages/pykka/", line 273, in get_all 2017-03-15_15:19:11.28910 return [future.get(timeout=timeout) for future in futures] 2017-03-15_15:19:11.28911 File "/usr/lib/python2.7/dist-packages/arvnodeman/", line 249, in <genexpr> 2017-03-15_15:19:11.28912 rec.shutdown_actor is None)) 2017-03-15_15:19:11.28913 AttributeError: 'NoneType' object has no attribute 'get_state'
and see that compute nodes are being spawned up and down continuously.
Checked the migrations on api and they're up to date.
Relevant packages versions:
- manage host:
ii arvados-node-manager 0.1.20170307182544-2 all The Arvados node manager ii python-arvados-python-client 0.1.20170309224426-2 all The Arvados Python SDK ii slurm-client 15.08.7-1build1 amd64 SLURM client side commands ii slurm-llnl 15.08.7-1build1 all transitional dummy package for slurm-wlm ii slurm-wlm 15.08.7-1build1 amd64 Simple Linux Utility for Resource Management ii slurm-wlm-basic-plugins 15.08.7-1build1 amd64 SLURM basic plugins ii slurmctld 15.08.7-1build1 amd64 SLURM central management daemon ii slurmd 15.08.7-1build1 amd64 SLURM compute node daemon
- api host:
ii arvados-api-server 0.1.20170314200114.b1aa6c8-7 amd64 Arvados API server - Arvados is a free and open source platform for big data science. ii arvados-git-httpd 0.1.20170301200434.2db0c3a-1 amd64 Provide authenticated http access to Arvados-hosted git repositories ii arvados-src 0.1.20170314200114.b1aa6c8-1 all The Arvados source code ii arvados-ws 0.1.20170301200434.2db0c3a-1 amd64 Arvados Websocket server ii crunch-dispatch-slurm 0.1.20170301200434.2db0c3a-1 amd64 Dispatch Crunch containers to a SLURM cluster ii python-arvados-fuse 0.1.20170309202836-2 all The Keep FUSE driver ii python-arvados-python-client 0.1.20170309224426-2 all The Arvados Python SDK ii slurm-client 15.08.7-1build1 amd64 SLURM client side commands ii slurm-llnl 15.08.7-1build1 all transitional dummy package for slurm-wlm ii slurm-wlm 15.08.7-1build1 amd64 Simple Linux Utility for Resource Management ii slurm-wlm-basic-plugins 15.08.7-1build1 amd64 SLURM basic plugins ii slurmctld 15.08.7-1build1 amd64 SLURM central management daemon ii slurmd 15.08.7-1build1 amd64 SLURM compute node daemon
- compute host:
ii arvados-docker-cleaner 0.1.20161031213850-3 all The Arvados Docker image cleaner ii arvados-src 0.1.20170314200114.b1aa6c8-1 all The Arvados source code ii libarvados-perl 0.1.20160218185759.32e3f6e-1 amd64 no description given ii python-arvados-fuse 0.1.20170309202836-2 all The Keep FUSE driver ii python-arvados-python-client 0.1.20170309224426-2 all The Arvados Python SDK ii slurm-client 15.08.7-1build1 amd64 SLURM client side commands ii slurm-wlm 15.08.7-1build1 amd64 Simple Linux Utility for Resource Management ii slurm-wlm-basic-plugins 15.08.7-1build1 amd64 SLURM basic plugins ii slurmctld 15.08.7-1build1 amd64 SLURM central management daemon ii slurmd 15.08.7-1build1 amd64 SLURM compute node daemon