Bug #16834
Updated by Ward Vandewege over 4 years ago
This is *arvados-dispatch-cloud 2.0.3-1*. I noticed something odd on our _lugli_ grafana graphs: !20200911-container-yoyo.png! There On lugli, there are 2 containers that are going from running to non-running state repeatedly: <pre> Sep 11 17:22:24 api.lugli.arvadosapi.com arvados-dispatch-cloud[582]: {"Address":"10.255.254.95","ContainerUUID":"lugli-dz642-lrg004e66zorxld","Instance":"i-0fb85bb32ca92f0c5","InstanceType":"c5large","PID":582,"level":"info","msg":"crunch-run process detected","time":"2020-09-11T17:22:24.370207112Z"} Sep 11 17:22:24 api.lugli.arvadosapi.com arvados-dispatch-cloud[582]: {"Address":"10.255.254.74","ContainerUUID":"lugli-dz642-n2gnjy6o3prkcam","Instance":"i-02cb2a6d6f6e33895","InstanceType":"c5large","PID":582,"level":"info","msg":"crunch-run process detected","time":"2020-09-11T17:22:24.370286849Z"} Sep 11 17:22:24 api.lugli.arvadosapi.com arvados-dispatch-cloud[582]: {"Address":"10.255.254.95","ContainerUUID":"lugli-dz642-lrg004e66zorxld","Instance":"i-0fb85bb32ca92f0c5","InstanceType":"c5large","PID":582,"Reason":"not in queue","level":"info","msg":"killing crunch-run process","time":"2020-09-11T17:22:24.370368549Z"} Sep 11 17:22:24 api.lugli.arvadosapi.com arvados-dispatch-cloud[582]: {"Address":"10.255.254.74","ContainerUUID":"lugli-dz642-n2gnjy6o3prkcam","Instance":"i-02cb2a6d6f6e33895","InstanceType":"c5large","PID":582,"Reason":"not in queue","level":"info","msg":"killing crunch-run process","time":"2020-09-11T17:22:24.370395251Z"} Sep 11 17:22:29 api.lugli.arvadosapi.com arvados-dispatch-cloud[582]: {"Address":"10.255.254.95","ContainerUUID":"lugli-dz642-lrg004e66zorxld","Instance":"i-0fb85bb32ca92f0c5","InstanceType":"c5large","PID":582,"Signal":15,"level":"info","msg":"sending signal","time":"2020-09-11T17:22:29.370501187Z"} Sep 11 17:22:29 api.lugli.arvadosapi.com arvados-dispatch-cloud[582]: {"Address":"10.255.254.74","ContainerUUID":"lugli-dz642-n2gnjy6o3prkcam","Instance":"i-02cb2a6d6f6e33895","InstanceType":"c5large","PID":582,"Signal":15,"level":"info","msg":"sending signal","time":"2020-09-11T17:22:29.370587727Z"} Sep 11 17:22:29 api.lugli.arvadosapi.com arvados-dispatch-cloud[582]: {"Address":"10.255.254.95","ContainerUUID":"lugli-dz642-lrg004e66zorxld","Instance":"i-0fb85bb32ca92f0c5","InstanceType":"c5large","PID":582,"level":"info","msg":"crunch-run process ended","time":"2020-09-11T17:22:29.387498437Z"} Sep 11 17:22:29 api.lugli.arvadosapi.com arvados-dispatch-cloud[582]: {"Address":"10.255.254.74","ContainerUUID":"lugli-dz642-n2gnjy6o3prkcam","Instance":"i-02cb2a6d6f6e33895","InstanceType":"c5large","PID":582,"level":"info","msg":"crunch-run process ended","time":"2020-09-11T17:22:29.387750208Z"} </pre> They are in _cancelled_ state according to Workbench. There are indeed two AWS VMs online, but docker is running nothing: <pre> root@ip-10-255-254-74:/home/admin# docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES </pre> and <pre> root@ip-10-255-254-95:/home/admin# docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES </pre> A full ps on the latter shows that arv-mount is still there, but the rest of our 'onion' is not: <pre> root@ip-10-255-254-95:/home/admin# ps auxwwwwf USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 2 0.0 0.0 0 0 ? S Sep10 0:00 [kthreadd] root 3 0.0 0.0 0 0 ? I< Sep10 0:00 \_ [rcu_gp] root 4 0.0 0.0 0 0 ? I< Sep10 0:00 \_ [rcu_par_gp] root 6 0.0 0.0 0 0 ? I< Sep10 0:00 \_ [kworker/0:0H-kblockd] root 8 0.0 0.0 0 0 ? I< Sep10 0:00 \_ [mm_percpu_wq] root 9 0.0 0.0 0 0 ? S Sep10 0:00 \_ [ksoftirqd/0] root 10 0.0 0.0 0 0 ? I Sep10 0:00 \_ [rcu_sched] root 11 0.0 0.0 0 0 ? I Sep10 0:00 \_ [rcu_bh] root 12 0.0 0.0 0 0 ? S Sep10 0:00 \_ [migration/0] root 14 0.0 0.0 0 0 ? S Sep10 0:00 \_ [cpuhp/0] root 15 0.0 0.0 0 0 ? S Sep10 0:00 \_ [cpuhp/1] root 16 0.0 0.0 0 0 ? S Sep10 0:00 \_ [migration/1] root 17 0.0 0.0 0 0 ? S Sep10 0:00 \_ [ksoftirqd/1] root 19 0.0 0.0 0 0 ? I< Sep10 0:00 \_ [kworker/1:0H-kblockd] root 20 0.0 0.0 0 0 ? S Sep10 0:00 \_ [kdevtmpfs] root 21 0.0 0.0 0 0 ? I< Sep10 0:00 \_ [netns] root 22 0.0 0.0 0 0 ? S Sep10 0:00 \_ [kauditd] root 23 0.0 0.0 0 0 ? S Sep10 0:00 \_ [khungtaskd] root 24 0.0 0.0 0 0 ? S Sep10 0:00 \_ [oom_reaper] root 25 0.0 0.0 0 0 ? I< Sep10 0:00 \_ [writeback] root 26 0.0 0.0 0 0 ? S Sep10 0:00 \_ [kcompactd0] root 27 0.0 0.0 0 0 ? SN Sep10 0:00 \_ [ksmd] root 28 0.0 0.0 0 0 ? SN Sep10 0:00 \_ [khugepaged] root 29 0.0 0.0 0 0 ? I< Sep10 0:00 \_ [crypto] root 30 0.0 0.0 0 0 ? I< Sep10 0:00 \_ [kintegrityd] root 31 0.0 0.0 0 0 ? I< Sep10 0:00 \_ [kblockd] root 32 0.0 0.0 0 0 ? S Sep10 0:00 \_ [watchdogd] root 36 0.0 0.0 0 0 ? S Sep10 0:38 \_ [kswapd0] root 52 0.0 0.0 0 0 ? I< Sep10 0:00 \_ [kthrotld] root 53 0.0 0.0 0 0 ? I< Sep10 0:00 \_ [ipv6_addrconf] root 63 0.0 0.0 0 0 ? I< Sep10 0:00 \_ [kstrp] root 110 0.0 0.0 0 0 ? I< Sep10 0:00 \_ [nvme-wq] root 111 0.0 0.0 0 0 ? I< Sep10 0:00 \_ [nvme-reset-wq] root 113 0.0 0.0 0 0 ? I< Sep10 0:00 \_ [ena] root 114 0.0 0.0 0 0 ? I< Sep10 0:00 \_ [nvme-delete-wq] root 138 0.0 0.0 0 0 ? I Sep10 0:00 \_ [kworker/u4:4-events_unbound] root 183 0.0 0.0 0 0 ? I< Sep10 0:01 \_ [kworker/u5:0-kcryptd] root 185 0.0 0.0 0 0 ? S Sep10 0:00 \_ [jbd2/nvme1n1p1-] root 186 0.0 0.0 0 0 ? I< Sep10 0:00 \_ [ext4-rsv-conver] root 230 0.0 0.0 0 0 ? I< Sep10 0:00 \_ [kworker/1:1H-kblockd] root 351 0.0 0.0 0 0 ? I< Sep10 0:00 \_ [kworker/0:1H-kblockd] root 804 0.0 0.0 0 0 ? I< Sep10 0:00 \_ [kdmflush] root 1155 0.0 0.0 0 0 ? I< Sep10 0:00 \_ [kdmflush] root 1159 0.0 0.0 0 0 ? I< Sep10 0:00 \_ [kcryptd_io] root 1160 0.0 0.0 0 0 ? I< Sep10 0:00 \_ [kcryptd] root 1161 0.0 0.0 0 0 ? S Sep10 0:00 \_ [dmcrypt_write] root 1164 0.0 0.0 0 0 ? I< Sep10 0:01 \_ [kworker/u5:2-kcryptd] root 1196 0.0 0.0 0 0 ? I< Sep10 0:00 \_ [xfsalloc] root 1199 0.0 0.0 0 0 ? I< Sep10 0:00 \_ [xfs_mru_cache] root 1202 0.0 0.0 0 0 ? I< Sep10 0:00 \_ [xfs-buf/dm-1] root 1203 0.0 0.0 0 0 ? I< Sep10 0:00 \_ [xfs-data/dm-1] root 1204 0.0 0.0 0 0 ? I< Sep10 0:00 \_ [xfs-conv/dm-1] root 1205 0.0 0.0 0 0 ? I< Sep10 0:00 \_ [xfs-cil/dm-1] root 1206 0.0 0.0 0 0 ? I< Sep10 0:00 \_ [xfs-reclaim/dm-] root 1207 0.0 0.0 0 0 ? I< Sep10 0:00 \_ [xfs-log/dm-1] root 1208 0.0 0.0 0 0 ? I< Sep10 0:00 \_ [xfs-eofblocks/d] root 1209 0.0 0.0 0 0 ? S Sep10 0:00 \_ [xfsaild/dm-1] root 3436 0.0 0.0 0 0 ? I Sep10 0:00 \_ [kworker/0:0-events] root 11630 0.0 0.0 0 0 ? I 15:54 0:00 \_ [kworker/0:1-cgroup_destroy] root 16205 0.0 0.0 0 0 ? I 17:17 0:00 \_ [kworker/1:2-xfs-cil/dm-1] root 17289 0.0 0.0 0 0 ? I 17:28 0:00 \_ [kworker/u4:1-events_unbound] root 17890 0.0 0.0 0 0 ? I 17:29 0:00 \_ [kworker/1:0-events_power_efficient] root 1 0.0 0.1 104040 7100 ? Ss Sep10 0:02 /sbin/init root 254 0.0 0.2 47656 10752 ? Ss Sep10 0:13 /lib/systemd/systemd-journald root 264 0.0 0.0 20120 972 ? Ss Sep10 0:00 /lib/systemd/systemd-udevd root 384 0.0 0.0 9488 3056 ? Ss Sep10 0:00 /sbin/dhclient -4 -v -i -pf /run/dhclient.ens5.pid -lf /var/lib/dhcp/dhclient.ens5.leases -I -df /var/lib/dhcp/dhclient6.ens5.leases ens5 root 492 0.0 0.0 9488 2100 ? Ss Sep10 0:00 /sbin/dhclient -6 -v -pf /run/dhclient6.ens5.pid -lf /var/lib/dhcp/dhclient6.ens5.leases -I -df /var/lib/dhcp/dhclient.ens5.leases -nw ens5 message+ 584 0.0 0.0 8800 556 ? Ss Sep10 0:00 /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation --syslog-only root 586 0.0 0.0 225824 1312 ? Ssl Sep10 0:01 /usr/sbin/rsyslogd -n -iNONE root 588 0.0 0.0 19520 960 ? Ss Sep10 0:00 /lib/systemd/systemd-logind root 591 0.0 0.0 10228 1200 ? Ss Sep10 0:00 /usr/sbin/cron -f _chrony 598 0.0 0.0 4816 368 ? S Sep10 0:00 /usr/sbin/chronyd -F -1 _chrony 599 0.0 0.0 4816 268 ? S Sep10 0:00 \_ /usr/sbin/chronyd -F -1 root 610 0.0 0.0 5612 112 tty1 Ss+ Sep10 0:00 /sbin/agetty -o -p -- \u --noclear tty1 linux root 611 0.0 0.0 7136 128 ttyS0 Ss+ Sep10 0:00 /sbin/agetty -o -p -- \u --keep-baud 115200,38400,9600 ttyS0 vt220 root 614 0.0 0.0 15852 2036 ? Ss Sep10 0:00 /usr/sbin/sshd -D root 1043 0.0 0.0 16612 1088 ? Ss Sep10 0:00 \_ sshd: admin [priv] admin 1062 0.0 0.0 16848 3000 ? S Sep10 0:06 | \_ sshd: admin@notty root 17232 0.0 0.2 16616 7872 ? Ss 17:27 0:00 \_ sshd: admin [priv] admin 17238 0.0 0.1 16616 4692 ? S 17:27 0:00 \_ sshd: admin@pts/0 admin 17239 0.0 0.1 9644 4672 pts/0 Ss 17:27 0:00 \_ -bash root 17252 0.0 0.0 11660 3720 pts/0 S 17:27 0:00 \_ sudo -s root 17253 0.0 0.0 8716 3668 pts/0 S 17:27 0:00 \_ /bin/bash root 19266 0.0 0.0 12512 3164 pts/0 R+ 17:41 0:00 \_ ps auxwwwwf admin 1046 0.0 0.1 21264 4464 ? Ss Sep10 0:00 /lib/systemd/systemd --user admin 1047 0.0 0.0 104880 2312 ? S Sep10 0:00 \_ (sd-pam) root 1237 0.0 1.4 1074276 54512 ? Ssl Sep10 1:26 /usr/sbin/dockerd -H fd:// --default-ulimit nofile=10000:10000 --dns 10.255.0.2 root 1244 0.1 0.5 952160 20456 ? Ssl Sep10 2:18 \_ docker-containerd --config /var/run/docker/containerd/containerd.toml --log-level info root 1477 0.0 0.4 33716 16772 ? Ss Sep10 0:00 /usr/share/python3/dist/arvados-docker-cleaner/bin/python /usr/bin/arvados-docker-cleaner root 4570 0.0 10.4 1608588 396860 ? Sl Sep10 0:04 /usr/share/python2.7/dist/python-arvados-fuse/bin/python /usr/bin/arv-mount --foreground --allow-other --read-write --crunchstat-interval=10 --file-cache 268435456 --mount-by-pdh by_id /tmp/crunch-run.lugli-dz642-lrg004e66zorxld.326060118/keep689459133 </pre>