Bug #18686
Updated by Ward Vandewege about 3 years ago
This is on Tordo (AWS). A compute node has been running for over 2 days: <pre> root@ip-10-253-254-49:/home/admin# uptime 13:05:39 up 2 days, 12 min, 1 user, load average: 0.02, 0.03, 0.00 </pre> The AWS ID is i-06cce6b3e0448b10d at 10.253.254.49. i-06cce6b3e0448b10d. The a-d-c logs say: <pre> tordo:~# journalctl -u arvados-dispatch-cloud.service -n100000|grep i-06cce6b3e0448b10d Jan 25 12:53:02 tordo.arvadosapi.com arvados-dispatch-cloud[15971]: {"Address":"10.253.254.49","ClusterID":"tordo","IdleBehavior":"run","Instance":"i-06cce6b3e0448b10d","InstanceType":"t3small","PID":15971,"State":"booting","level":"info","msg":"instance appeared in cloud","time":"2022-01-25T12:53:02.659570016Z"} Jan 25 12:53:38 tordo.arvadosapi.com arvados-dispatch-cloud[15971]: {"Address":"10.253.254.49","ClusterID":"tordo","Command":"sudo docker ps -q","Instance":"i-06cce6b3e0448b10d","InstanceType":"t3small","PID":15971,"level":"info","msg":"boot probe succeeded","stderr":"","stdout":"","time":"2022-01-25T12:53:38.438366587Z"} Jan 25 12:53:38 tordo.arvadosapi.com arvados-dispatch-cloud[15971]: {"Address":"10.253.254.49","ClusterID":"tordo","Instance":"i-06cce6b3e0448b10d","InstanceType":"t3small","PID":15971,"cmd":"sudo sh -c 'set -e; dstdir=\"/var/lib/arvados/\"; dstfile=\"/var/lib/arvados/crunch-run~89bea761309098144f11941ae52673f2\"; mkdir -p \"$dstdir\"; touch \"$dstfile\"; chmod 0755 \"$dstdir\" \"$dstfile\"; cat \u003e\"$dstfile\"'","hash":"89bea761309098144f11941ae52673f2","level":"info","msg":"installing runner binary on worker","path":"/var/lib/arvados/crunch-run~89bea761309098144f11941ae52673f2","time":"2022-01-25T12:53:38.471727831Z"} Jan 25 12:53:39 tordo.arvadosapi.com arvados-dispatch-cloud[15971]: {"Address":"10.253.254.49","ClusterID":"tordo","Instance":"i-06cce6b3e0448b10d","InstanceType":"t3small","PID":15971,"ProbeStart":"2022-01-25T12:53:30.678564128Z","level":"info","msg":"instance booted; will try probeRunning","time":"2022-01-25T12:53:39.371129567Z"} Jan 25 12:53:39 tordo.arvadosapi.com arvados-dispatch-cloud[15971]: {"Address":"10.253.254.49","ClusterID":"tordo","Instance":"i-06cce6b3e0448b10d","InstanceType":"t3small","PID":15971,"ProbeStart":"2022-01-25T12:53:30.678564128Z","RunningContainers":0,"State":"idle","level":"info","msg":"probes succeeded, instance is in service","time":"2022-01-25T12:53:39.429678726Z"} Jan 25 12:53:39 tordo.arvadosapi.com arvados-dispatch-cloud[15971]: {"Address":"10.253.254.49","ClusterID":"tordo","ContainerUUID":"tordo-dz642-2l2xwswnvbvc8gk","Instance":"i-06cce6b3e0448b10d","InstanceType":"t3small","PID":15971,"level":"info","msg":"crunch-run process started","time":"2022-01-25T12:53:39.471920534Z"} </pre> On the node itself, @crunch-run@ and @arv-mount@ are running but nothing more: <pre> root 18119 0.0 0.1 12512 3212 pts/0 R+ 13:07 0:00 \_ ps auxwf admin 631 0.0 0.4 21140 8976 ? Ss Jan25 0:00 /lib/systemd/systemd --user admin 632 0.0 0.1 104852 2364 ? S Jan25 0:00 \_ (sd-pam) root 1148 0.1 2.5 1333156 50724 ? Sl Jan25 5:24 /var/lib/arvados/crunch-run~89bea761309098144f11941ae52673f2 -no-detach --detach --stdin-config --runtime-engine=singularity tordo-dz642-2l2xwswnvbvc8gk root 1161 0.1 2.4 1311492 48800 ? Sl Jan25 4:36 \_ /usr/share/python3/dist/python3-arvados-fuse/bin/python /usr/bin/arv-mount --foreground --read-write --storage-classes default --crunchstat-interval=10 --file-cache 268435456 --mount-by-pdh by_id --disable-event-listening --mount-by-id by_uuid /tmp/crunch-run.tordo-dz642-2l2xwswnvbvc8gk.4167237599/keep3068660272 root 1307 0.0 4.3 865924 87296 ? Ssl Jan25 0:37 /usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock --default-ulimit nofile=10000:10000 --dns 10.253.0.2 root 2104 0.0 1.1 32536 22200 ? Ss Jan25 0:00 /usr/share/python3/dist/arvados-docker-cleaner/bin/python /usr/bin/arvados-docker-cleaner </pre> <pre> root@ip-10-253-254-49:/tmp# docker ps -a CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES </pre> <pre> root@ip-10-253-254-49:/tmp# v total 4 drwxrwxrwt 4 root root 48 Jan 27 08:25 ./ drwxr-xr-x 18 root root 4096 Jan 25 12:53 ../ drwx--x--x 14 root root 182 Jan 25 12:53 docker-data/ drwxr-xr-x 2 root root 18 Jan 25 12:54 hsperfdata_root/ </pre>