Bug #11254
closed[Node manager] backtrace on node shutdown
Description
After upgrading c97qk to Xenial, trying to run a test job https://workbench.c97qk.arvadosapi.com/pipeline_instances/c97qk-d1hrv-gqjk9u23yq22sg8) it gets queued and never runs. Checking arvados-node-manager logs, I find lots of these:
2017-03-15_15:19:11.28897 2017-03-15 15:19:11 NodeManagerDaemonActor.0da41a4cb360[26201] ERROR: while calculating nodes wanted for size <arvnodeman.jobqueue.CloudSizeWrapper object at 0x7fb1346919d0> 2017-03-15_15:19:11.28900 Traceback (most recent call last): 2017-03-15_15:19:11.28900 File "/usr/lib/python2.7/dist-packages/arvnodeman/daemon.py", line 332, in update_server_wishlist 2017-03-15_15:19:11.28903 nodes_wanted = self._nodes_wanted(size) 2017-03-15_15:19:11.28904 File "/usr/lib/python2.7/dist-packages/arvnodeman/daemon.py", line 290, in _nodes_wanted 2017-03-15_15:19:11.28904 counts = self._state_counts(size) 2017-03-15_15:19:11.28905 File "/usr/lib/python2.7/dist-packages/arvnodeman/daemon.py", line 256, in _state_counts 2017-03-15_15:19:11.28906 states = self._node_states(size) 2017-03-15_15:19:11.28908 File "/usr/lib/python2.7/dist-packages/arvnodeman/daemon.py", line 247, in _node_states 2017-03-15_15:19:11.28908 for rec in self.cloud_nodes.nodes.itervalues() 2017-03-15_15:19:11.28909 File "/usr/lib/python2.7/dist-packages/pykka/future.py", line 273, in get_all 2017-03-15_15:19:11.28910 return [future.get(timeout=timeout) for future in futures] 2017-03-15_15:19:11.28911 File "/usr/lib/python2.7/dist-packages/arvnodeman/daemon.py", line 249, in <genexpr> 2017-03-15_15:19:11.28912 rec.shutdown_actor is None)) 2017-03-15_15:19:11.28913 AttributeError: 'NoneType' object has no attribute 'get_state'
and see that compute nodes are being spawned up and down continuously.
Checked the migrations on api and they're up to date.
Relevant packages versions:
- manage host:
ii arvados-node-manager 0.1.20170307182544-2 all The Arvados node manager ii python-arvados-python-client 0.1.20170309224426-2 all The Arvados Python SDK ii slurm-client 15.08.7-1build1 amd64 SLURM client side commands ii slurm-llnl 15.08.7-1build1 all transitional dummy package for slurm-wlm ii slurm-wlm 15.08.7-1build1 amd64 Simple Linux Utility for Resource Management ii slurm-wlm-basic-plugins 15.08.7-1build1 amd64 SLURM basic plugins ii slurmctld 15.08.7-1build1 amd64 SLURM central management daemon ii slurmd 15.08.7-1build1 amd64 SLURM compute node daemon
- api host:
ii arvados-api-server 0.1.20170314200114.b1aa6c8-7 amd64 Arvados API server - Arvados is a free and open source platform for big data science. ii arvados-git-httpd 0.1.20170301200434.2db0c3a-1 amd64 Provide authenticated http access to Arvados-hosted git repositories ii arvados-src 0.1.20170314200114.b1aa6c8-1 all The Arvados source code ii arvados-ws 0.1.20170301200434.2db0c3a-1 amd64 Arvados Websocket server ii crunch-dispatch-slurm 0.1.20170301200434.2db0c3a-1 amd64 Dispatch Crunch containers to a SLURM cluster ii python-arvados-fuse 0.1.20170309202836-2 all The Keep FUSE driver ii python-arvados-python-client 0.1.20170309224426-2 all The Arvados Python SDK ii slurm-client 15.08.7-1build1 amd64 SLURM client side commands ii slurm-llnl 15.08.7-1build1 all transitional dummy package for slurm-wlm ii slurm-wlm 15.08.7-1build1 amd64 Simple Linux Utility for Resource Management ii slurm-wlm-basic-plugins 15.08.7-1build1 amd64 SLURM basic plugins ii slurmctld 15.08.7-1build1 amd64 SLURM central management daemon ii slurmd 15.08.7-1build1 amd64 SLURM compute node daemon
- compute host:
ii arvados-docker-cleaner 0.1.20161031213850-3 all The Arvados Docker image cleaner ii arvados-src 0.1.20170314200114.b1aa6c8-1 all The Arvados source code ii libarvados-perl 0.1.20160218185759.32e3f6e-1 amd64 no description given ii python-arvados-fuse 0.1.20170309202836-2 all The Keep FUSE driver ii python-arvados-python-client 0.1.20170309224426-2 all The Arvados Python SDK ii slurm-client 15.08.7-1build1 amd64 SLURM client side commands ii slurm-wlm 15.08.7-1build1 amd64 Simple Linux Utility for Resource Management ii slurm-wlm-basic-plugins 15.08.7-1build1 amd64 SLURM basic plugins ii slurmctld 15.08.7-1build1 amd64 SLURM central management daemon ii slurmd 15.08.7-1build1 amd64 SLURM compute node daemon
Files
Updated by Peter Amstutz almost 8 years ago
The backtrace looks like an unintended side effect of #10846.
On further research, we determined the underlying reason the job wasn't running was that slurmd wasn't being started on the compute node.
Updated by Javier Bértoli almost 8 years ago
- Status changed from New to Resolved
- Target version set to 2017-03-29 sprint
- % Done changed from 0 to 100
Fixed slurmd in the compute nodes and it is working ok now.
Updated by Tom Morris almost 8 years ago
- Status changed from Resolved to In Progress
- Assigned To set to Peter Amstutz
- Priority changed from Urgent to Normal
Updated by Peter Amstutz almost 8 years ago
- Subject changed from Node manager errors when trying to run a job in Ubuntu Xenial (can't run jobs) to [Node manager] backtrace on node shutdown
Updated by Tom Clegg almost 8 years ago
11254-nodemanager-no-actor @ 2c69d49 LGTM
...although I'd say the duplicated conditions here were already a bit smelly and are getting worse... it might be easier to follow/maintain like
for rec in ...: if ...: if ...: states += ['shutdown'] else: proxy_states += [rec.actor.get_state()] return states + pykka.get_all(proxy_states)
kinda nitpicky for this bugfix though
Updated by Peter Amstutz almost 8 years ago
- Status changed from In Progress to Resolved
- % Done changed from 0 to 100
Applied in changeset arvados|commit:996b635700d7270229200a56d2c2b9f7c96a84fb.