Project

General

Profile

Bug #18686

Updated by Ward Vandewege about 2 years ago

This is on Tordo (AWS). A compute node has been running for over 2 days: 

 <pre> 
 root@ip-10-253-254-49:/home/admin# uptime  
  13:05:39 up 2 days, 12 min,    1 user,    load average: 0.02, 0.03, 0.00 
 </pre> 

 The AWS ID is i-06cce6b3e0448b10d. 

 The a-d-c logs say: 

 <pre> 
 tordo:~# journalctl -u arvados-dispatch-cloud.service -n100000|grep i-06cce6b3e0448b10d 
 Jan 25 12:53:02 tordo.arvadosapi.com arvados-dispatch-cloud[15971]: {"Address":"10.253.254.49","ClusterID":"tordo","IdleBehavior":"run","Instance":"i-06cce6b3e0448b10d","InstanceType":"t3small","PID":15971,"State":"booting","level":"info","msg":"instance appeared in cloud","time":"2022-01-25T12:53:02.659570016Z"} 
 Jan 25 12:53:38 tordo.arvadosapi.com arvados-dispatch-cloud[15971]: {"Address":"10.253.254.49","ClusterID":"tordo","Command":"sudo docker ps -q","Instance":"i-06cce6b3e0448b10d","InstanceType":"t3small","PID":15971,"level":"info","msg":"boot probe succeeded","stderr":"","stdout":"","time":"2022-01-25T12:53:38.438366587Z"} 
 Jan 25 12:53:38 tordo.arvadosapi.com arvados-dispatch-cloud[15971]: {"Address":"10.253.254.49","ClusterID":"tordo","Instance":"i-06cce6b3e0448b10d","InstanceType":"t3small","PID":15971,"cmd":"sudo sh -c 'set -e; dstdir=\"/var/lib/arvados/\"; dstfile=\"/var/lib/arvados/crunch-run~89bea761309098144f11941ae52673f2\"; mkdir -p \"$dstdir\"; touch \"$dstfile\"; chmod 0755 \"$dstdir\" \"$dstfile\"; cat \u003e\"$dstfile\"'","hash":"89bea761309098144f11941ae52673f2","level":"info","msg":"installing runner binary on worker","path":"/var/lib/arvados/crunch-run~89bea761309098144f11941ae52673f2","time":"2022-01-25T12:53:38.471727831Z"} 
 Jan 25 12:53:39 tordo.arvadosapi.com arvados-dispatch-cloud[15971]: {"Address":"10.253.254.49","ClusterID":"tordo","Instance":"i-06cce6b3e0448b10d","InstanceType":"t3small","PID":15971,"ProbeStart":"2022-01-25T12:53:30.678564128Z","level":"info","msg":"instance booted; will try probeRunning","time":"2022-01-25T12:53:39.371129567Z"} 
 Jan 25 12:53:39 tordo.arvadosapi.com arvados-dispatch-cloud[15971]: {"Address":"10.253.254.49","ClusterID":"tordo","Instance":"i-06cce6b3e0448b10d","InstanceType":"t3small","PID":15971,"ProbeStart":"2022-01-25T12:53:30.678564128Z","RunningContainers":0,"State":"idle","level":"info","msg":"probes succeeded, instance is in service","time":"2022-01-25T12:53:39.429678726Z"} 
 Jan 25 12:53:39 tordo.arvadosapi.com arvados-dispatch-cloud[15971]: {"Address":"10.253.254.49","ClusterID":"tordo","ContainerUUID":"tordo-dz642-2l2xwswnvbvc8gk","Instance":"i-06cce6b3e0448b10d","InstanceType":"t3small","PID":15971,"level":"info","msg":"crunch-run process started","time":"2022-01-25T12:53:39.471920534Z"} 
 </pre> 

 On the node itself, @crunch-run@ and @arv-mount@ are running but nothing more: 

 <pre> 
 root       18119    0.0    0.1    12512    3212 pts/0      R+     13:07     0:00                        \_ ps auxwf 
 admin        631    0.0    0.4    21140    8976 ?          Ss     Jan25     0:00 /lib/systemd/systemd --user 
 admin        632    0.0    0.1 104852    2364 ?          S      Jan25     0:00    \_ (sd-pam) 
 root        1148    0.1    2.5 1333156 50724 ?         Sl     Jan25     5:24 /var/lib/arvados/crunch-run~89bea761309098144f11941ae52673f2 -no-detach --detach --stdin-config --runtime-engine=singularity tordo-dz642-2l2xwswnvbvc8gk 
 root        1161    0.1    2.4 1311492 48800 ?         Sl     Jan25     4:36    \_ /usr/share/python3/dist/python3-arvados-fuse/bin/python /usr/bin/arv-mount --foreground --read-write --storage-classes default --crunchstat-interval=10 --file-cache 268435456 --mount-by-pdh by_id --disable-event-listening --mount-by-id by_uuid /tmp/crunch-run.tordo-dz642-2l2xwswnvbvc8gk.4167237599/keep3068660272 
 root        1307    0.0    4.3 865924 87296 ?          Ssl    Jan25     0:37 /usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock --default-ulimit nofile=10000:10000 --dns 10.253.0.2 
 root        2104    0.0    1.1    32536 22200 ?          Ss     Jan25     0:00 /usr/share/python3/dist/arvados-docker-cleaner/bin/python /usr/bin/arvados-docker-cleaner 
 </pre> 

 <pre> 
 root@ip-10-253-254-49:/tmp# docker ps -a 
 CONTAINER ID     IMAGE       COMMAND     CREATED     STATUS      PORTS       NAMES 
 </pre> 

 <pre> 
 root@ip-10-253-254-49:/tmp# v 
 total 4 
 drwxrwxrwt    4 root root     48 Jan 27 08:25 ./ 
 drwxr-xr-x 18 root root 4096 Jan 25 12:53 ../ 
 drwx--x--x 14 root root    182 Jan 25 12:53 docker-data/ 
 drwxr-xr-x    2 root root     18 Jan 25 12:54 hsperfdata_root/ 
 </pre>

Back