Bug #16834

Updated by Ward Vandewege about 1 year ago

This is *arvados-dispatch-cloud 2.0.3-1*.

I noticed something odd on our _lugli_ grafana graphs:

!20200911-container-yoyo.png!

There On lugli, there are 2 containers that are going from running to non-running state repeatedly:

<pre>
Sep 11 17:22:24 api.lugli.arvadosapi.com arvados-dispatch-cloud[582]: {"Address":"10.255.254.95","ContainerUUID":"lugli-dz642-lrg004e66zorxld","Instance":"i-0fb85bb32ca92f0c5","InstanceType":"c5large","PID":582,"level":"info","msg":"crunch-run process detected","time":"2020-09-11T17:22:24.370207112Z"}
Sep 11 17:22:24 api.lugli.arvadosapi.com arvados-dispatch-cloud[582]: {"Address":"10.255.254.74","ContainerUUID":"lugli-dz642-n2gnjy6o3prkcam","Instance":"i-02cb2a6d6f6e33895","InstanceType":"c5large","PID":582,"level":"info","msg":"crunch-run process detected","time":"2020-09-11T17:22:24.370286849Z"}
Sep 11 17:22:24 api.lugli.arvadosapi.com arvados-dispatch-cloud[582]: {"Address":"10.255.254.95","ContainerUUID":"lugli-dz642-lrg004e66zorxld","Instance":"i-0fb85bb32ca92f0c5","InstanceType":"c5large","PID":582,"Reason":"not in queue","level":"info","msg":"killing crunch-run process","time":"2020-09-11T17:22:24.370368549Z"}
Sep 11 17:22:24 api.lugli.arvadosapi.com arvados-dispatch-cloud[582]: {"Address":"10.255.254.74","ContainerUUID":"lugli-dz642-n2gnjy6o3prkcam","Instance":"i-02cb2a6d6f6e33895","InstanceType":"c5large","PID":582,"Reason":"not in queue","level":"info","msg":"killing crunch-run process","time":"2020-09-11T17:22:24.370395251Z"}
Sep 11 17:22:29 api.lugli.arvadosapi.com arvados-dispatch-cloud[582]: {"Address":"10.255.254.95","ContainerUUID":"lugli-dz642-lrg004e66zorxld","Instance":"i-0fb85bb32ca92f0c5","InstanceType":"c5large","PID":582,"Signal":15,"level":"info","msg":"sending signal","time":"2020-09-11T17:22:29.370501187Z"}
Sep 11 17:22:29 api.lugli.arvadosapi.com arvados-dispatch-cloud[582]: {"Address":"10.255.254.74","ContainerUUID":"lugli-dz642-n2gnjy6o3prkcam","Instance":"i-02cb2a6d6f6e33895","InstanceType":"c5large","PID":582,"Signal":15,"level":"info","msg":"sending signal","time":"2020-09-11T17:22:29.370587727Z"}
Sep 11 17:22:29 api.lugli.arvadosapi.com arvados-dispatch-cloud[582]: {"Address":"10.255.254.95","ContainerUUID":"lugli-dz642-lrg004e66zorxld","Instance":"i-0fb85bb32ca92f0c5","InstanceType":"c5large","PID":582,"level":"info","msg":"crunch-run process ended","time":"2020-09-11T17:22:29.387498437Z"}
Sep 11 17:22:29 api.lugli.arvadosapi.com arvados-dispatch-cloud[582]: {"Address":"10.255.254.74","ContainerUUID":"lugli-dz642-n2gnjy6o3prkcam","Instance":"i-02cb2a6d6f6e33895","InstanceType":"c5large","PID":582,"level":"info","msg":"crunch-run process ended","time":"2020-09-11T17:22:29.387750208Z"}
</pre>

They are in _cancelled_ state according to Workbench.

There are indeed two AWS VMs online, but docker is running nothing:

<pre>
root@ip-10-255-254-74:/home/admin# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
</pre>

and

<pre>
root@ip-10-255-254-95:/home/admin# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
</pre>

A full ps on the latter shows that arv-mount is still there, but the rest of our 'onion' is not:

<pre>
root@ip-10-255-254-95:/home/admin# ps auxwwwwf
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 2 0.0 0.0 0 0 ? S Sep10 0:00 [kthreadd]
root 3 0.0 0.0 0 0 ? I< Sep10 0:00 \_ [rcu_gp]
root 4 0.0 0.0 0 0 ? I< Sep10 0:00 \_ [rcu_par_gp]
root 6 0.0 0.0 0 0 ? I< Sep10 0:00 \_ [kworker/0:0H-kblockd]
root 8 0.0 0.0 0 0 ? I< Sep10 0:00 \_ [mm_percpu_wq]
root 9 0.0 0.0 0 0 ? S Sep10 0:00 \_ [ksoftirqd/0]
root 10 0.0 0.0 0 0 ? I Sep10 0:00 \_ [rcu_sched]
root 11 0.0 0.0 0 0 ? I Sep10 0:00 \_ [rcu_bh]
root 12 0.0 0.0 0 0 ? S Sep10 0:00 \_ [migration/0]
root 14 0.0 0.0 0 0 ? S Sep10 0:00 \_ [cpuhp/0]
root 15 0.0 0.0 0 0 ? S Sep10 0:00 \_ [cpuhp/1]
root 16 0.0 0.0 0 0 ? S Sep10 0:00 \_ [migration/1]
root 17 0.0 0.0 0 0 ? S Sep10 0:00 \_ [ksoftirqd/1]
root 19 0.0 0.0 0 0 ? I< Sep10 0:00 \_ [kworker/1:0H-kblockd]
root 20 0.0 0.0 0 0 ? S Sep10 0:00 \_ [kdevtmpfs]
root 21 0.0 0.0 0 0 ? I< Sep10 0:00 \_ [netns]
root 22 0.0 0.0 0 0 ? S Sep10 0:00 \_ [kauditd]
root 23 0.0 0.0 0 0 ? S Sep10 0:00 \_ [khungtaskd]
root 24 0.0 0.0 0 0 ? S Sep10 0:00 \_ [oom_reaper]
root 25 0.0 0.0 0 0 ? I< Sep10 0:00 \_ [writeback]
root 26 0.0 0.0 0 0 ? S Sep10 0:00 \_ [kcompactd0]
root 27 0.0 0.0 0 0 ? SN Sep10 0:00 \_ [ksmd]
root 28 0.0 0.0 0 0 ? SN Sep10 0:00 \_ [khugepaged]
root 29 0.0 0.0 0 0 ? I< Sep10 0:00 \_ [crypto]
root 30 0.0 0.0 0 0 ? I< Sep10 0:00 \_ [kintegrityd]
root 31 0.0 0.0 0 0 ? I< Sep10 0:00 \_ [kblockd]
root 32 0.0 0.0 0 0 ? S Sep10 0:00 \_ [watchdogd]
root 36 0.0 0.0 0 0 ? S Sep10 0:38 \_ [kswapd0]
root 52 0.0 0.0 0 0 ? I< Sep10 0:00 \_ [kthrotld]
root 53 0.0 0.0 0 0 ? I< Sep10 0:00 \_ [ipv6_addrconf]
root 63 0.0 0.0 0 0 ? I< Sep10 0:00 \_ [kstrp]
root 110 0.0 0.0 0 0 ? I< Sep10 0:00 \_ [nvme-wq]
root 111 0.0 0.0 0 0 ? I< Sep10 0:00 \_ [nvme-reset-wq]
root 113 0.0 0.0 0 0 ? I< Sep10 0:00 \_ [ena]
root 114 0.0 0.0 0 0 ? I< Sep10 0:00 \_ [nvme-delete-wq]
root 138 0.0 0.0 0 0 ? I Sep10 0:00 \_ [kworker/u4:4-events_unbound]
root 183 0.0 0.0 0 0 ? I< Sep10 0:01 \_ [kworker/u5:0-kcryptd]
root 185 0.0 0.0 0 0 ? S Sep10 0:00 \_ [jbd2/nvme1n1p1-]
root 186 0.0 0.0 0 0 ? I< Sep10 0:00 \_ [ext4-rsv-conver]
root 230 0.0 0.0 0 0 ? I< Sep10 0:00 \_ [kworker/1:1H-kblockd]
root 351 0.0 0.0 0 0 ? I< Sep10 0:00 \_ [kworker/0:1H-kblockd]
root 804 0.0 0.0 0 0 ? I< Sep10 0:00 \_ [kdmflush]
root 1155 0.0 0.0 0 0 ? I< Sep10 0:00 \_ [kdmflush]
root 1159 0.0 0.0 0 0 ? I< Sep10 0:00 \_ [kcryptd_io]
root 1160 0.0 0.0 0 0 ? I< Sep10 0:00 \_ [kcryptd]
root 1161 0.0 0.0 0 0 ? S Sep10 0:00 \_ [dmcrypt_write]
root 1164 0.0 0.0 0 0 ? I< Sep10 0:01 \_ [kworker/u5:2-kcryptd]
root 1196 0.0 0.0 0 0 ? I< Sep10 0:00 \_ [xfsalloc]
root 1199 0.0 0.0 0 0 ? I< Sep10 0:00 \_ [xfs_mru_cache]
root 1202 0.0 0.0 0 0 ? I< Sep10 0:00 \_ [xfs-buf/dm-1]
root 1203 0.0 0.0 0 0 ? I< Sep10 0:00 \_ [xfs-data/dm-1]
root 1204 0.0 0.0 0 0 ? I< Sep10 0:00 \_ [xfs-conv/dm-1]
root 1205 0.0 0.0 0 0 ? I< Sep10 0:00 \_ [xfs-cil/dm-1]
root 1206 0.0 0.0 0 0 ? I< Sep10 0:00 \_ [xfs-reclaim/dm-]
root 1207 0.0 0.0 0 0 ? I< Sep10 0:00 \_ [xfs-log/dm-1]
root 1208 0.0 0.0 0 0 ? I< Sep10 0:00 \_ [xfs-eofblocks/d]
root 1209 0.0 0.0 0 0 ? S Sep10 0:00 \_ [xfsaild/dm-1]
root 3436 0.0 0.0 0 0 ? I Sep10 0:00 \_ [kworker/0:0-events]
root 11630 0.0 0.0 0 0 ? I 15:54 0:00 \_ [kworker/0:1-cgroup_destroy]
root 16205 0.0 0.0 0 0 ? I 17:17 0:00 \_ [kworker/1:2-xfs-cil/dm-1]
root 17289 0.0 0.0 0 0 ? I 17:28 0:00 \_ [kworker/u4:1-events_unbound]
root 17890 0.0 0.0 0 0 ? I 17:29 0:00 \_ [kworker/1:0-events_power_efficient]
root 1 0.0 0.1 104040 7100 ? Ss Sep10 0:02 /sbin/init
root 254 0.0 0.2 47656 10752 ? Ss Sep10 0:13 /lib/systemd/systemd-journald
root 264 0.0 0.0 20120 972 ? Ss Sep10 0:00 /lib/systemd/systemd-udevd
root 384 0.0 0.0 9488 3056 ? Ss Sep10 0:00 /sbin/dhclient -4 -v -i -pf /run/dhclient.ens5.pid -lf /var/lib/dhcp/dhclient.ens5.leases -I -df /var/lib/dhcp/dhclient6.ens5.leases ens5
root 492 0.0 0.0 9488 2100 ? Ss Sep10 0:00 /sbin/dhclient -6 -v -pf /run/dhclient6.ens5.pid -lf /var/lib/dhcp/dhclient6.ens5.leases -I -df /var/lib/dhcp/dhclient.ens5.leases -nw ens5
message+ 584 0.0 0.0 8800 556 ? Ss Sep10 0:00 /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation --syslog-only
root 586 0.0 0.0 225824 1312 ? Ssl Sep10 0:01 /usr/sbin/rsyslogd -n -iNONE
root 588 0.0 0.0 19520 960 ? Ss Sep10 0:00 /lib/systemd/systemd-logind
root 591 0.0 0.0 10228 1200 ? Ss Sep10 0:00 /usr/sbin/cron -f
_chrony 598 0.0 0.0 4816 368 ? S Sep10 0:00 /usr/sbin/chronyd -F -1
_chrony 599 0.0 0.0 4816 268 ? S Sep10 0:00 \_ /usr/sbin/chronyd -F -1
root 610 0.0 0.0 5612 112 tty1 Ss+ Sep10 0:00 /sbin/agetty -o -p -- \u --noclear tty1 linux
root 611 0.0 0.0 7136 128 ttyS0 Ss+ Sep10 0:00 /sbin/agetty -o -p -- \u --keep-baud 115200,38400,9600 ttyS0 vt220
root 614 0.0 0.0 15852 2036 ? Ss Sep10 0:00 /usr/sbin/sshd -D
root 1043 0.0 0.0 16612 1088 ? Ss Sep10 0:00 \_ sshd: admin [priv]
admin 1062 0.0 0.0 16848 3000 ? S Sep10 0:06 | \_ sshd: admin@notty
root 17232 0.0 0.2 16616 7872 ? Ss 17:27 0:00 \_ sshd: admin [priv]
admin 17238 0.0 0.1 16616 4692 ? S 17:27 0:00 \_ sshd: admin@pts/0
admin 17239 0.0 0.1 9644 4672 pts/0 Ss 17:27 0:00 \_ -bash
root 17252 0.0 0.0 11660 3720 pts/0 S 17:27 0:00 \_ sudo -s
root 17253 0.0 0.0 8716 3668 pts/0 S 17:27 0:00 \_ /bin/bash
root 19266 0.0 0.0 12512 3164 pts/0 R+ 17:41 0:00 \_ ps auxwwwwf
admin 1046 0.0 0.1 21264 4464 ? Ss Sep10 0:00 /lib/systemd/systemd --user
admin 1047 0.0 0.0 104880 2312 ? S Sep10 0:00 \_ (sd-pam)
root 1237 0.0 1.4 1074276 54512 ? Ssl Sep10 1:26 /usr/sbin/dockerd -H fd:// --default-ulimit nofile=10000:10000 --dns 10.255.0.2
root 1244 0.1 0.5 952160 20456 ? Ssl Sep10 2:18 \_ docker-containerd --config /var/run/docker/containerd/containerd.toml --log-level info
root 1477 0.0 0.4 33716 16772 ? Ss Sep10 0:00 /usr/share/python3/dist/arvados-docker-cleaner/bin/python /usr/bin/arvados-docker-cleaner
root 4570 0.0 10.4 1608588 396860 ? Sl Sep10 0:04 /usr/share/python2.7/dist/python-arvados-fuse/bin/python /usr/bin/arv-mount --foreground --allow-other --read-write --crunchstat-interval=10 --file-cache 268435456 --mount-by-pdh by_id /tmp/crunch-run.lugli-dz642-lrg004e66zorxld.326060118/keep689459133
</pre>

Back