Project

General

Profile

Bug #16834

Updated by Ward Vandewege over 3 years ago

This is *arvados-dispatch-cloud             2.0.3-1*. 

 On lugli, there are 2 containers that are going from running to non-running state repeatedly: 

 <pre> 
 Sep 11 17:22:24 api.lugli.arvadosapi.com arvados-dispatch-cloud[582]: {"Address":"10.255.254.95","ContainerUUID":"lugli-dz642-lrg004e66zorxld","Instance":"i-0fb85bb32ca92f0c5","InstanceType":"c5large","PID":582,"level":"info","msg":"crunch-run process detected","time":"2020-09-11T17:22:24.370207112Z"} 
 Sep 11 17:22:24 api.lugli.arvadosapi.com arvados-dispatch-cloud[582]: {"Address":"10.255.254.74","ContainerUUID":"lugli-dz642-n2gnjy6o3prkcam","Instance":"i-02cb2a6d6f6e33895","InstanceType":"c5large","PID":582,"level":"info","msg":"crunch-run process detected","time":"2020-09-11T17:22:24.370286849Z"} 
 Sep 11 17:22:24 api.lugli.arvadosapi.com arvados-dispatch-cloud[582]: {"Address":"10.255.254.95","ContainerUUID":"lugli-dz642-lrg004e66zorxld","Instance":"i-0fb85bb32ca92f0c5","InstanceType":"c5large","PID":582,"Reason":"not in queue","level":"info","msg":"killing crunch-run process","time":"2020-09-11T17:22:24.370368549Z"} 
 Sep 11 17:22:24 api.lugli.arvadosapi.com arvados-dispatch-cloud[582]: {"Address":"10.255.254.74","ContainerUUID":"lugli-dz642-n2gnjy6o3prkcam","Instance":"i-02cb2a6d6f6e33895","InstanceType":"c5large","PID":582,"Reason":"not in queue","level":"info","msg":"killing crunch-run process","time":"2020-09-11T17:22:24.370395251Z"} 
 Sep 11 17:22:29 api.lugli.arvadosapi.com arvados-dispatch-cloud[582]: {"Address":"10.255.254.95","ContainerUUID":"lugli-dz642-lrg004e66zorxld","Instance":"i-0fb85bb32ca92f0c5","InstanceType":"c5large","PID":582,"Signal":15,"level":"info","msg":"sending signal","time":"2020-09-11T17:22:29.370501187Z"} 
 Sep 11 17:22:29 api.lugli.arvadosapi.com arvados-dispatch-cloud[582]: {"Address":"10.255.254.74","ContainerUUID":"lugli-dz642-n2gnjy6o3prkcam","Instance":"i-02cb2a6d6f6e33895","InstanceType":"c5large","PID":582,"Signal":15,"level":"info","msg":"sending signal","time":"2020-09-11T17:22:29.370587727Z"} 
 Sep 11 17:22:29 api.lugli.arvadosapi.com arvados-dispatch-cloud[582]: {"Address":"10.255.254.95","ContainerUUID":"lugli-dz642-lrg004e66zorxld","Instance":"i-0fb85bb32ca92f0c5","InstanceType":"c5large","PID":582,"level":"info","msg":"crunch-run process ended","time":"2020-09-11T17:22:29.387498437Z"} 
 Sep 11 17:22:29 api.lugli.arvadosapi.com arvados-dispatch-cloud[582]: {"Address":"10.255.254.74","ContainerUUID":"lugli-dz642-n2gnjy6o3prkcam","Instance":"i-02cb2a6d6f6e33895","InstanceType":"c5large","PID":582,"level":"info","msg":"crunch-run process ended","time":"2020-09-11T17:22:29.387750208Z"} 
 </pre> 

 They are in _cancelled_ state according to Workbench. 

 There are indeed two AWS VMs online, but docker is running nothing: 

 <pre> 
 root@ip-10-255-254-74:/home/admin# docker ps 
 CONTAINER ID          IMAGE                 COMMAND               CREATED               STATUS                PORTS                 NAMES 
 </pre> 

 and 

 <pre> 
 root@ip-10-255-254-95:/home/admin# docker ps 
 CONTAINER ID          IMAGE                 COMMAND               CREATED               STATUS                PORTS                 NAMES 
 </pre> 

 A full ps on the latter shows that arv-mount is still there, but the rest of our 'onion' is not: 

 <pre> 
 root@ip-10-255-254-95:/home/admin# ps auxwwwwf 
 USER         PID %CPU %MEM      VSZ     RSS TTY        STAT START     TIME COMMAND 
 root           2    0.0    0.0        0       0 ?          S      Sep10     0:00 [kthreadd] 
 root           3    0.0    0.0        0       0 ?          I<     Sep10     0:00    \_ [rcu_gp] 
 root           4    0.0    0.0        0       0 ?          I<     Sep10     0:00    \_ [rcu_par_gp] 
 root           6    0.0    0.0        0       0 ?          I<     Sep10     0:00    \_ [kworker/0:0H-kblockd] 
 root           8    0.0    0.0        0       0 ?          I<     Sep10     0:00    \_ [mm_percpu_wq] 
 root           9    0.0    0.0        0       0 ?          S      Sep10     0:00    \_ [ksoftirqd/0] 
 root          10    0.0    0.0        0       0 ?          I      Sep10     0:00    \_ [rcu_sched] 
 root          11    0.0    0.0        0       0 ?          I      Sep10     0:00    \_ [rcu_bh] 
 root          12    0.0    0.0        0       0 ?          S      Sep10     0:00    \_ [migration/0] 
 root          14    0.0    0.0        0       0 ?          S      Sep10     0:00    \_ [cpuhp/0] 
 root          15    0.0    0.0        0       0 ?          S      Sep10     0:00    \_ [cpuhp/1] 
 root          16    0.0    0.0        0       0 ?          S      Sep10     0:00    \_ [migration/1] 
 root          17    0.0    0.0        0       0 ?          S      Sep10     0:00    \_ [ksoftirqd/1] 
 root          19    0.0    0.0        0       0 ?          I<     Sep10     0:00    \_ [kworker/1:0H-kblockd] 
 root          20    0.0    0.0        0       0 ?          S      Sep10     0:00    \_ [kdevtmpfs] 
 root          21    0.0    0.0        0       0 ?          I<     Sep10     0:00    \_ [netns] 
 root          22    0.0    0.0        0       0 ?          S      Sep10     0:00    \_ [kauditd] 
 root          23    0.0    0.0        0       0 ?          S      Sep10     0:00    \_ [khungtaskd] 
 root          24    0.0    0.0        0       0 ?          S      Sep10     0:00    \_ [oom_reaper] 
 root          25    0.0    0.0        0       0 ?          I<     Sep10     0:00    \_ [writeback] 
 root          26    0.0    0.0        0       0 ?          S      Sep10     0:00    \_ [kcompactd0] 
 root          27    0.0    0.0        0       0 ?          SN     Sep10     0:00    \_ [ksmd] 
 root          28    0.0    0.0        0       0 ?          SN     Sep10     0:00    \_ [khugepaged] 
 root          29    0.0    0.0        0       0 ?          I<     Sep10     0:00    \_ [crypto] 
 root          30    0.0    0.0        0       0 ?          I<     Sep10     0:00    \_ [kintegrityd] 
 root          31    0.0    0.0        0       0 ?          I<     Sep10     0:00    \_ [kblockd] 
 root          32    0.0    0.0        0       0 ?          S      Sep10     0:00    \_ [watchdogd] 
 root          36    0.0    0.0        0       0 ?          S      Sep10     0:38    \_ [kswapd0] 
 root          52    0.0    0.0        0       0 ?          I<     Sep10     0:00    \_ [kthrotld] 
 root          53    0.0    0.0        0       0 ?          I<     Sep10     0:00    \_ [ipv6_addrconf] 
 root          63    0.0    0.0        0       0 ?          I<     Sep10     0:00    \_ [kstrp] 
 root         110    0.0    0.0        0       0 ?          I<     Sep10     0:00    \_ [nvme-wq] 
 root         111    0.0    0.0        0       0 ?          I<     Sep10     0:00    \_ [nvme-reset-wq] 
 root         113    0.0    0.0        0       0 ?          I<     Sep10     0:00    \_ [ena] 
 root         114    0.0    0.0        0       0 ?          I<     Sep10     0:00    \_ [nvme-delete-wq] 
 root         138    0.0    0.0        0       0 ?          I      Sep10     0:00    \_ [kworker/u4:4-events_unbound] 
 root         183    0.0    0.0        0       0 ?          I<     Sep10     0:01    \_ [kworker/u5:0-kcryptd] 
 root         185    0.0    0.0        0       0 ?          S      Sep10     0:00    \_ [jbd2/nvme1n1p1-] 
 root         186    0.0    0.0        0       0 ?          I<     Sep10     0:00    \_ [ext4-rsv-conver] 
 root         230    0.0    0.0        0       0 ?          I<     Sep10     0:00    \_ [kworker/1:1H-kblockd] 
 root         351    0.0    0.0        0       0 ?          I<     Sep10     0:00    \_ [kworker/0:1H-kblockd] 
 root         804    0.0    0.0        0       0 ?          I<     Sep10     0:00    \_ [kdmflush] 
 root        1155    0.0    0.0        0       0 ?          I<     Sep10     0:00    \_ [kdmflush] 
 root        1159    0.0    0.0        0       0 ?          I<     Sep10     0:00    \_ [kcryptd_io] 
 root        1160    0.0    0.0        0       0 ?          I<     Sep10     0:00    \_ [kcryptd] 
 root        1161    0.0    0.0        0       0 ?          S      Sep10     0:00    \_ [dmcrypt_write] 
 root        1164    0.0    0.0        0       0 ?          I<     Sep10     0:01    \_ [kworker/u5:2-kcryptd] 
 root        1196    0.0    0.0        0       0 ?          I<     Sep10     0:00    \_ [xfsalloc] 
 root        1199    0.0    0.0        0       0 ?          I<     Sep10     0:00    \_ [xfs_mru_cache] 
 root        1202    0.0    0.0        0       0 ?          I<     Sep10     0:00    \_ [xfs-buf/dm-1] 
 root        1203    0.0    0.0        0       0 ?          I<     Sep10     0:00    \_ [xfs-data/dm-1] 
 root        1204    0.0    0.0        0       0 ?          I<     Sep10     0:00    \_ [xfs-conv/dm-1] 
 root        1205    0.0    0.0        0       0 ?          I<     Sep10     0:00    \_ [xfs-cil/dm-1] 
 root        1206    0.0    0.0        0       0 ?          I<     Sep10     0:00    \_ [xfs-reclaim/dm-] 
 root        1207    0.0    0.0        0       0 ?          I<     Sep10     0:00    \_ [xfs-log/dm-1] 
 root        1208    0.0    0.0        0       0 ?          I<     Sep10     0:00    \_ [xfs-eofblocks/d] 
 root        1209    0.0    0.0        0       0 ?          S      Sep10     0:00    \_ [xfsaild/dm-1] 
 root        3436    0.0    0.0        0       0 ?          I      Sep10     0:00    \_ [kworker/0:0-events] 
 root       11630    0.0    0.0        0       0 ?          I      15:54     0:00    \_ [kworker/0:1-cgroup_destroy] 
 root       16205    0.0    0.0        0       0 ?          I      17:17     0:00    \_ [kworker/1:2-xfs-cil/dm-1] 
 root       17289    0.0    0.0        0       0 ?          I      17:28     0:00    \_ [kworker/u4:1-events_unbound] 
 root       17890    0.0    0.0        0       0 ?          I      17:29     0:00    \_ [kworker/1:0-events_power_efficient] 
 root           1    0.0    0.1 104040    7100 ?          Ss     Sep10     0:02 /sbin/init 
 root         254    0.0    0.2    47656 10752 ?          Ss     Sep10     0:13 /lib/systemd/systemd-journald 
 root         264    0.0    0.0    20120     972 ?          Ss     Sep10     0:00 /lib/systemd/systemd-udevd 
 root         384    0.0    0.0     9488    3056 ?          Ss     Sep10     0:00 /sbin/dhclient -4 -v -i -pf /run/dhclient.ens5.pid -lf /var/lib/dhcp/dhclient.ens5.leases -I -df /var/lib/dhcp/dhclient6.ens5.leases ens5 
 root         492    0.0    0.0     9488    2100 ?          Ss     Sep10     0:00 /sbin/dhclient -6 -v -pf /run/dhclient6.ens5.pid -lf /var/lib/dhcp/dhclient6.ens5.leases -I -df /var/lib/dhcp/dhclient.ens5.leases -nw ens5 
 message+     584    0.0    0.0     8800     556 ?          Ss     Sep10     0:00 /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation --syslog-only 
 root         586    0.0    0.0 225824    1312 ?          Ssl    Sep10     0:01 /usr/sbin/rsyslogd -n -iNONE 
 root         588    0.0    0.0    19520     960 ?          Ss     Sep10     0:00 /lib/systemd/systemd-logind 
 root         591    0.0    0.0    10228    1200 ?          Ss     Sep10     0:00 /usr/sbin/cron -f 
 _chrony      598    0.0    0.0     4816     368 ?          S      Sep10     0:00 /usr/sbin/chronyd -F -1 
 _chrony      599    0.0    0.0     4816     268 ?          S      Sep10     0:00    \_ /usr/sbin/chronyd -F -1 
 root         610    0.0    0.0     5612     112 tty1       Ss+    Sep10     0:00 /sbin/agetty -o -p -- \u --noclear tty1 linux 
 root         611    0.0    0.0     7136     128 ttyS0      Ss+    Sep10     0:00 /sbin/agetty -o -p -- \u --keep-baud 115200,38400,9600 ttyS0 vt220 
 root         614    0.0    0.0    15852    2036 ?          Ss     Sep10     0:00 /usr/sbin/sshd -D 
 root        1043    0.0    0.0    16612    1088 ?          Ss     Sep10     0:00    \_ sshd: admin [priv] 
 admin       1062    0.0    0.0    16848    3000 ?          S      Sep10     0:06    |     \_ sshd: admin@notty 
 root       17232    0.0    0.2    16616    7872 ?          Ss     17:27     0:00    \_ sshd: admin [priv] 
 admin      17238    0.0    0.1    16616    4692 ?          S      17:27     0:00        \_ sshd: admin@pts/0 
 admin      17239    0.0    0.1     9644    4672 pts/0      Ss     17:27     0:00            \_ -bash 
 root       17252    0.0    0.0    11660    3720 pts/0      S      17:27     0:00                \_ sudo -s 
 root       17253    0.0    0.0     8716    3668 pts/0      S      17:27     0:00                    \_ /bin/bash 
 root       19266    0.0    0.0    12512    3164 pts/0      R+     17:41     0:00                        \_ ps auxwwwwf 
 admin       1046    0.0    0.1    21264    4464 ?          Ss     Sep10     0:00 /lib/systemd/systemd --user 
 admin       1047    0.0    0.0 104880    2312 ?          S      Sep10     0:00    \_ (sd-pam) 
 root        1237    0.0    1.4 1074276 54512 ?         Ssl    Sep10     1:26 /usr/sbin/dockerd -H fd:// --default-ulimit nofile=10000:10000 --dns 10.255.0.2 
 root        1244    0.1    0.5 952160 20456 ?          Ssl    Sep10     2:18    \_ docker-containerd --config /var/run/docker/containerd/containerd.toml --log-level info 
 root        1477    0.0    0.4    33716 16772 ?          Ss     Sep10     0:00 /usr/share/python3/dist/arvados-docker-cleaner/bin/python /usr/bin/arvados-docker-cleaner 
 root        4570    0.0 10.4 1608588 396860 ?        Sl     Sep10     0:04 /usr/share/python2.7/dist/python-arvados-fuse/bin/python /usr/bin/arv-mount --foreground --allow-other --read-write --crunchstat-interval=10 --file-cache 268435456 --mount-by-pdh by_id /tmp/crunch-run.lugli-dz642-lrg004e66zorxld.326060118/keep689459133 
 </pre> 

Back