Project

General

Profile

Actions

Bug #11254

closed

[Node manager] backtrace on node shutdown

Added by Javier Bértoli almost 6 years ago. Updated almost 6 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Node Manager
Target version:
Start date:
03/15/2017
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
0.5

Description

After upgrading c97qk to Xenial, trying to run a test job https://workbench.c97qk.arvadosapi.com/pipeline_instances/c97qk-d1hrv-gqjk9u23yq22sg8) it gets queued and never runs. Checking arvados-node-manager logs, I find lots of these:

2017-03-15_15:19:11.28897 2017-03-15 15:19:11 NodeManagerDaemonActor.0da41a4cb360[26201] ERROR: while calculating nodes wanted for size <arvnodeman.jobqueue.CloudSizeWrapper object at 0x7fb1346919d0>
2017-03-15_15:19:11.28900 Traceback (most recent call last):
2017-03-15_15:19:11.28900   File "/usr/lib/python2.7/dist-packages/arvnodeman/daemon.py", line 332, in update_server_wishlist
2017-03-15_15:19:11.28903     nodes_wanted = self._nodes_wanted(size)
2017-03-15_15:19:11.28904   File "/usr/lib/python2.7/dist-packages/arvnodeman/daemon.py", line 290, in _nodes_wanted
2017-03-15_15:19:11.28904     counts = self._state_counts(size)
2017-03-15_15:19:11.28905   File "/usr/lib/python2.7/dist-packages/arvnodeman/daemon.py", line 256, in _state_counts
2017-03-15_15:19:11.28906     states = self._node_states(size)
2017-03-15_15:19:11.28908   File "/usr/lib/python2.7/dist-packages/arvnodeman/daemon.py", line 247, in _node_states
2017-03-15_15:19:11.28908     for rec in self.cloud_nodes.nodes.itervalues()
2017-03-15_15:19:11.28909   File "/usr/lib/python2.7/dist-packages/pykka/future.py", line 273, in get_all
2017-03-15_15:19:11.28910     return [future.get(timeout=timeout) for future in futures]
2017-03-15_15:19:11.28911   File "/usr/lib/python2.7/dist-packages/arvnodeman/daemon.py", line 249, in <genexpr>
2017-03-15_15:19:11.28912     rec.shutdown_actor is None))
2017-03-15_15:19:11.28913 AttributeError: 'NoneType' object has no attribute 'get_state'

and see that compute nodes are being spawned up and down continuously.

Checked the migrations on api and they're up to date.

Relevant packages versions:

  • manage host:
    ii  arvados-node-manager                0.1.20170307182544-2              all          The Arvados node manager
    ii  python-arvados-python-client        0.1.20170309224426-2              all          The Arvados Python SDK
    ii  slurm-client                        15.08.7-1build1                   amd64        SLURM client side commands
    ii  slurm-llnl                          15.08.7-1build1                   all          transitional dummy package for slurm-wlm
    ii  slurm-wlm                           15.08.7-1build1                   amd64        Simple Linux Utility for Resource Management
    ii  slurm-wlm-basic-plugins             15.08.7-1build1                   amd64        SLURM basic plugins
    ii  slurmctld                           15.08.7-1build1                   amd64        SLURM central management daemon
    ii  slurmd                              15.08.7-1build1                   amd64        SLURM compute node daemon
    
  • api host:
    ii  arvados-api-server                  0.1.20170314200114.b1aa6c8-7      amd64        Arvados API server - Arvados is a free and open source platform for big data science.
    ii  arvados-git-httpd                   0.1.20170301200434.2db0c3a-1      amd64        Provide authenticated http access to Arvados-hosted git repositories
    ii  arvados-src                         0.1.20170314200114.b1aa6c8-1      all          The Arvados source code
    ii  arvados-ws                          0.1.20170301200434.2db0c3a-1      amd64        Arvados Websocket server
    ii  crunch-dispatch-slurm               0.1.20170301200434.2db0c3a-1      amd64        Dispatch Crunch containers to a SLURM cluster
    ii  python-arvados-fuse                 0.1.20170309202836-2              all          The Keep FUSE driver
    ii  python-arvados-python-client        0.1.20170309224426-2              all          The Arvados Python SDK
    ii  slurm-client                        15.08.7-1build1                   amd64        SLURM client side commands
    ii  slurm-llnl                          15.08.7-1build1                   all          transitional dummy package for slurm-wlm
    ii  slurm-wlm                           15.08.7-1build1                   amd64        Simple Linux Utility for Resource Management
    ii  slurm-wlm-basic-plugins             15.08.7-1build1                   amd64        SLURM basic plugins
    ii  slurmctld                           15.08.7-1build1                   amd64        SLURM central management daemon
    ii  slurmd                              15.08.7-1build1                   amd64        SLURM compute node daemon
    
  • compute host:
    ii  arvados-docker-cleaner              0.1.20161031213850-3              all          The Arvados Docker image cleaner
    ii  arvados-src                         0.1.20170314200114.b1aa6c8-1      all          The Arvados source code
    ii  libarvados-perl                     0.1.20160218185759.32e3f6e-1      amd64        no description given
    ii  python-arvados-fuse                 0.1.20170309202836-2              all          The Keep FUSE driver
    ii  python-arvados-python-client        0.1.20170309224426-2              all          The Arvados Python SDK
    ii  slurm-client                        15.08.7-1build1                   amd64        SLURM client side commands
    ii  slurm-wlm                           15.08.7-1build1                   amd64        Simple Linux Utility for Resource Management
    ii  slurm-wlm-basic-plugins             15.08.7-1build1                   amd64        SLURM basic plugins
    ii  slurmctld                           15.08.7-1build1                   amd64        SLURM central management daemon
    ii  slurmd                              15.08.7-1build1                   amd64        SLURM compute node daemon
    

Files

current (242 KB) current arvados-node-manager log file Javier Bértoli, 03/15/2017 03:51 PM

Subtasks 1 (0 open1 closed)

Task #11271: Review 11254-nodemanager-no-actorResolvedPeter Amstutz03/15/2017

Actions
Actions

Also available in: Atom PDF