Project

General

Profile

Actions

Bug #7481

closed

Docker Daemon failure or FUSE problem

Added by Bryan Cosca over 8 years ago. Updated over 8 years ago.

Status:
Duplicate
Priority:
Normal
Assigned To:
-
Category:
-
Target version:
-
Story points:
-

Description

https://workbench.tb05z.arvadosapi.com/collections/80c6a5e6a158508bc58969e93d5348e5+87/tb05z-8i9sb-2vmkv1gm5jvbw8a.log.txt

2015-10-07_21:18:17 tb05z-8i9sb-2vmkv1gm5jvbw8a 9201 0 stderr run-command: caught exception
2015-10-07_21:18:17 tb05z-8i9sb-2vmkv1gm5jvbw8a 9201 0 stderr Traceback (most recent call last):
2015-10-07_21:18:17 tb05z-8i9sb-2vmkv1gm5jvbw8a 9201 0 stderr File "/tmp/crunch-job/src/crunch_scripts/run-command", line 393, in <module>
2015-10-07_21:18:17 tb05z-8i9sb-2vmkv1gm5jvbw8a 9201 0 stderr (pid, status) = os.wait()
2015-10-07_21:18:17 tb05z-8i9sb-2vmkv1gm5jvbw8a 9201 0 stderr OSError: [Errno 4] Interrupted system call
2015-10-07_21:18:17 tb05z-8i9sb-2vmkv1gm5jvbw8a 9201 0 stderr run-command: the following output files will be saved to keep:
2015-10-07_21:18:17 tb05z-8i9sb-2vmkv1gm5jvbw8a 9201 0 stderr run-command: 11411 ./scatter.intervals
2015-10-07_21:18:17 tb05z-8i9sb-2vmkv1gm5jvbw8a 9201 0 stderr run-command: 0 ./.scatter.intervals.done
2015-10-07_21:18:17 tb05z-8i9sb-2vmkv1gm5jvbw8a 9201 0 stderr run-command: 0 ./24385-200_AH5G7WCCXX_S4_L004_R1_001_markdup.target.intervals.list
2015-10-07_21:18:17 tb05z-8i9sb-2vmkv1gm5jvbw8a 9201 0 stderr run-command: start writing output to keep
2015-10-07_21:18:19 tb05z-8i9sb-2vmkv1gm5jvbw8a 9201 0 stderr time="2015-10-07T21:18:19Z" level=fatal msg="Post http:///var/run/docker.sock/v1.18/containers/975e98ca99280f647ab2cda7b45eddbad95d5e3dceb3916a0f0d8bc2d4067c4a/wait: dial unix /var/run/docker.sock: no such file or directory. Are you trying to connect to a TLS-enabled daemon without TLS?"
2015-10-07_21:18:20 tb05z-8i9sb-2vmkv1gm5jvbw8a 9201 0 stderr 2015-10-07 21:18:20 arvados.arvados_fuse20316 ERROR: Unhandled exception during FUSE operation
2015-10-07_21:18:20 tb05z-8i9sb-2vmkv1gm5jvbw8a 9201 0 stderr Traceback (most recent call last):
2015-10-07_21:18:20 tb05z-8i9sb-2vmkv1gm5jvbw8a 9201 0 stderr File "/usr/local/lib/python2.7/dist-packages/arvados_fuse/__init__.py", line 276, in catch_exceptions_wrapper
2015-10-07_21:18:20 tb05z-8i9sb-2vmkv1gm5jvbw8a 9201 0 stderr return orig_func(self, *args, **kwargs)
2015-10-07_21:18:20 tb05z-8i9sb-2vmkv1gm5jvbw8a 9201 0 stderr File "/usr/local/lib/python2.7/dist-packages/arvados_fuse/__init__.py", line 461, in forget
2015-10-07_21:18:20 tb05z-8i9sb-2vmkv1gm5jvbw8a 9201 0 stderr ent = self.inodes[inode]
2015-10-07_21:18:20 tb05z-8i9sb-2vmkv1gm5jvbw8a 9201 0 stderr File "/usr/local/lib/python2.7/dist-packages/arvados_fuse/__init__.py", line 214, in getitem
2015-10-07_21:18:20 tb05z-8i9sb-2vmkv1gm5jvbw8a 9201 0 stderr return self._entries[item]
2015-10-07_21:18:20 tb05z-8i9sb-2vmkv1gm5jvbw8a 9201 0 stderr KeyError: 47L
2015-10-07_21:18:20 tb05z-8i9sb-2vmkv1gm5jvbw8a 9201 0 stderr srun: error: compute1: task 0: Terminated
2015-10-07_21:18:20 tb05z-8i9sb-2vmkv1gm5jvbw8a 9201 0 stderr srun: Force Terminated job step 215.5
2015-10-07_21:18:21 tb05z-8i9sb-2vmkv1gm5jvbw8a 9201 0 child 9378 on compute1.1 exit 15 success=false
2015-10-07_21:18:21 tb05z-8i9sb-2vmkv1gm5jvbw8a 9201 0 failure (#1, permanent) after 126 seconds


Related issues

Related to Arvados - Bug #5956: [Deployment] Docker configuration changes restart Docker on compute nodes, interrupting running jobsDuplicate05/07/2015Actions
Actions #1

Updated by Ward Vandewege over 8 years ago

Hmm, this was my fault; I had to tweak the docker runit startup script, and pushed that to puppet yesterday evening. That caused a docker restart in the middle of the job:

Oct  7 21:17:23 compute1 puppet-agent[21037]: (/Stage[main]/User::Virtual/Useraccount[nico]/File[/home/nico/.vim/plugin/detect_puppet.vim]) Dependency Group[nico] has failures: true
Oct  7 21:17:23 compute1 puppet-agent[21037]: (/Stage[main]/User::Virtual/Useraccount[nico]/File[/home/nico/.vim/plugin/detect_puppet.vim]) Skipping because of failed dependencies
Oct  7 21:17:30 compute1 puppet-agent[21037]: (/Stage[main]/Apt::Update/Exec[apt_update]/returns) executed successfully
Oct  7 21:17:43 compute1 puppet-agent[21037]: (/Stage[main]/Arvados-compute-dependencies/Package[libcurl4-openssl-dev]/ensure) ensure changed 'purged' to 'present'
Oct  7 21:17:57 compute1 puppet-agent[21037]: (/Stage[main]/Arvados-compute-node/Arvados-compute-node_def[puppet]/Package[libarvados-perl]/ensure) ensure changed 'purged' to 'latest'
Oct  7 21:18:01 compute1 arvados-compute-ping[29004]: Last ping at 2015-10-07T21:18:01.806780000Z
Oct  7 21:18:11 compute1 puppet-agent[21037]: (/Stage[main]/Arvados-compute-node/Arvados-compute-node_def[puppet]/Runit::Service[docker.io]/File[/etc/sv/docker.io/run]/content) content changed '{md5}d71a377f37cb0a62ef320d224edfc845' to '{md5}2b5c2d566f67deb8bbee80f2aba81aec'
Oct  7 21:18:19 compute1 puppet-agent[21037]: (/Stage[main]/Arvados-compute-node/Arvados-compute-node_def[puppet]/Runit::Service[docker.io]/Runit::Service::Enabled[docker.io]/Exec[sv restart docker.io]) Triggered 'refresh' from 1 events
Oct  7 21:18:21 compute1 slurmd[9824]: lllp_distribution jobid [215] implicit auto binding: sockets, dist 1

which caused this failure.

We have a ticket (#5956) to solve this issue permanently; puppet should be doing things while jobs are running.

Actions #2

Updated by Brett Smith over 8 years ago

  • Status changed from New to Duplicate
Actions

Also available in: Atom PDF