Bug #11254

[Node manager] backtrace on node shutdown

Added by Javier Bértoli 5 months ago. Updated 5 months ago.

Status:ResolvedStart date:03/15/2017
Priority:NormalDue date:
Assignee:Peter Amstutz% Done:

100%

Category:Node Manager
Target version:2017-03-29 sprint
Story points0.5Remaining (hours)0.00 hour
Velocity based estimate-

Description

After upgrading c97qk to Xenial, trying to run a test job https://workbench.c97qk.arvadosapi.com/pipeline_instances/c97qk-d1hrv-gqjk9u23yq22sg8) it gets queued and never runs. Checking arvados-node-manager logs, I find lots of these:

2017-03-15_15:19:11.28897 2017-03-15 15:19:11 NodeManagerDaemonActor.0da41a4cb360[26201] ERROR: while calculating nodes wanted for size <arvnodeman.jobqueue.CloudSizeWrapper object at 0x7fb1346919d0>
2017-03-15_15:19:11.28900 Traceback (most recent call last):
2017-03-15_15:19:11.28900   File "/usr/lib/python2.7/dist-packages/arvnodeman/daemon.py", line 332, in update_server_wishlist
2017-03-15_15:19:11.28903     nodes_wanted = self._nodes_wanted(size)
2017-03-15_15:19:11.28904   File "/usr/lib/python2.7/dist-packages/arvnodeman/daemon.py", line 290, in _nodes_wanted
2017-03-15_15:19:11.28904     counts = self._state_counts(size)
2017-03-15_15:19:11.28905   File "/usr/lib/python2.7/dist-packages/arvnodeman/daemon.py", line 256, in _state_counts
2017-03-15_15:19:11.28906     states = self._node_states(size)
2017-03-15_15:19:11.28908   File "/usr/lib/python2.7/dist-packages/arvnodeman/daemon.py", line 247, in _node_states
2017-03-15_15:19:11.28908     for rec in self.cloud_nodes.nodes.itervalues()
2017-03-15_15:19:11.28909   File "/usr/lib/python2.7/dist-packages/pykka/future.py", line 273, in get_all
2017-03-15_15:19:11.28910     return [future.get(timeout=timeout) for future in futures]
2017-03-15_15:19:11.28911   File "/usr/lib/python2.7/dist-packages/arvnodeman/daemon.py", line 249, in <genexpr>
2017-03-15_15:19:11.28912     rec.shutdown_actor is None))
2017-03-15_15:19:11.28913 AttributeError: 'NoneType' object has no attribute 'get_state'

and see that compute nodes are being spawned up and down continuously.

Checked the migrations on api and they're up to date.

Relevant packages versions:

  • manage host:
    ii  arvados-node-manager                0.1.20170307182544-2              all          The Arvados node manager
    ii  python-arvados-python-client        0.1.20170309224426-2              all          The Arvados Python SDK
    ii  slurm-client                        15.08.7-1build1                   amd64        SLURM client side commands
    ii  slurm-llnl                          15.08.7-1build1                   all          transitional dummy package for slurm-wlm
    ii  slurm-wlm                           15.08.7-1build1                   amd64        Simple Linux Utility for Resource Management
    ii  slurm-wlm-basic-plugins             15.08.7-1build1                   amd64        SLURM basic plugins
    ii  slurmctld                           15.08.7-1build1                   amd64        SLURM central management daemon
    ii  slurmd                              15.08.7-1build1                   amd64        SLURM compute node daemon
    
  • api host:
    ii  arvados-api-server                  0.1.20170314200114.b1aa6c8-7      amd64        Arvados API server - Arvados is a free and open source platform for big data science.
    ii  arvados-git-httpd                   0.1.20170301200434.2db0c3a-1      amd64        Provide authenticated http access to Arvados-hosted git repositories
    ii  arvados-src                         0.1.20170314200114.b1aa6c8-1      all          The Arvados source code
    ii  arvados-ws                          0.1.20170301200434.2db0c3a-1      amd64        Arvados Websocket server
    ii  crunch-dispatch-slurm               0.1.20170301200434.2db0c3a-1      amd64        Dispatch Crunch containers to a SLURM cluster
    ii  python-arvados-fuse                 0.1.20170309202836-2              all          The Keep FUSE driver
    ii  python-arvados-python-client        0.1.20170309224426-2              all          The Arvados Python SDK
    ii  slurm-client                        15.08.7-1build1                   amd64        SLURM client side commands
    ii  slurm-llnl                          15.08.7-1build1                   all          transitional dummy package for slurm-wlm
    ii  slurm-wlm                           15.08.7-1build1                   amd64        Simple Linux Utility for Resource Management
    ii  slurm-wlm-basic-plugins             15.08.7-1build1                   amd64        SLURM basic plugins
    ii  slurmctld                           15.08.7-1build1                   amd64        SLURM central management daemon
    ii  slurmd                              15.08.7-1build1                   amd64        SLURM compute node daemon
    
  • compute host:
    ii  arvados-docker-cleaner              0.1.20161031213850-3              all          The Arvados Docker image cleaner
    ii  arvados-src                         0.1.20170314200114.b1aa6c8-1      all          The Arvados source code
    ii  libarvados-perl                     0.1.20160218185759.32e3f6e-1      amd64        no description given
    ii  python-arvados-fuse                 0.1.20170309202836-2              all          The Keep FUSE driver
    ii  python-arvados-python-client        0.1.20170309224426-2              all          The Arvados Python SDK
    ii  slurm-client                        15.08.7-1build1                   amd64        SLURM client side commands
    ii  slurm-wlm                           15.08.7-1build1                   amd64        Simple Linux Utility for Resource Management
    ii  slurm-wlm-basic-plugins             15.08.7-1build1                   amd64        SLURM basic plugins
    ii  slurmctld                           15.08.7-1build1                   amd64        SLURM central management daemon
    ii  slurmd                              15.08.7-1build1                   amd64        SLURM compute node daemon
    

current - arvados-node-manager log file (242 KB) Javier Bértoli, 03/15/2017 03:51 pm


Subtasks

Task #11271: Review 11254-nodemanager-no-actorResolvedPeter Amstutz

Associated revisions

Revision 996b6357
Added by Peter Amstutz 5 months ago

Merge branch '11254-nodemanager-no-actor' closes #11254

History

#1 Updated by Javier Bértoli 5 months ago

#2 Updated by Peter Amstutz 5 months ago

The backtrace looks like an unintended side effect of #10846.

On further research, we determined the underlying reason the job wasn't running was that slurmd wasn't being started on the compute node.

#3 Updated by Javier Bértoli 5 months ago

  • Status changed from New to Resolved
  • Target version set to 2017-03-29 sprint
  • % Done changed from 0 to 100

Fixed slurmd in the compute nodes and it is working ok now.

#4 Updated by Tom Morris 5 months ago

  • Status changed from Resolved to In Progress
  • Assignee set to Peter Amstutz
  • Priority changed from Urgent to Normal

#5 Updated by Peter Amstutz 5 months ago

  • Story points set to 0.5

#6 Updated by Peter Amstutz 5 months ago

  • Subject changed from Node manager errors when trying to run a job in Ubuntu Xenial (can't run jobs) to [Node manager] backtrace on node shutdown

#7 Updated by Tom Clegg 5 months ago

11254-nodemanager-no-actor @ 2c69d49 LGTM

...although I'd say the duplicated conditions here were already a bit smelly and are getting worse... it might be easier to follow/maintain like

for rec in ...:
  if ...:
    if ...:
      states += ['shutdown']
    else:
      proxy_states += [rec.actor.get_state()]
return states + pykka.get_all(proxy_states)

kinda nitpicky for this bugfix though

#8 Updated by Peter Amstutz 5 months ago

  • Status changed from In Progress to Resolved
  • % Done changed from 0 to 100

Applied in changeset arvados|commit:996b635700d7270229200a56d2c2b9f7c96a84fb.

Also available in: Atom PDF