Project

General

Profile

Bug #18686

Updated by Ward Vandewege about 2 years ago

This is on Tordo (AWS). A compute node has been running for over 2 days: 

 <pre> 
 root@ip-10-253-254-49:/home/admin# uptime  
  13:05:39 up 2 days, 12 min,    1 user,    load average: 0.02, 0.03, 0.00 
 </pre> 

 The AWS ID is i-06cce6b3e0448b10d at 10.253.254.49. 

 The a-d-c logs say: 

 <pre> 
 tordo:~# journalctl -u arvados-dispatch-cloud.service -n100000|grep i-06cce6b3e0448b10d 
 Jan 25 12:53:02 tordo.arvadosapi.com arvados-dispatch-cloud[15971]: {"Address":"10.253.254.49","ClusterID":"tordo","IdleBehavior":"run","Instance":"i-06cce6b3e0448b10d","InstanceType":"t3small","PID":15971,"State":"booting","level":"info","msg":"instance appeared in cloud","time":"2022-01-25T12:53:02.659570016Z"} 
 Jan 25 12:53:38 tordo.arvadosapi.com arvados-dispatch-cloud[15971]: {"Address":"10.253.254.49","ClusterID":"tordo","Command":"sudo docker ps -q","Instance":"i-06cce6b3e0448b10d","InstanceType":"t3small","PID":15971,"level":"info","msg":"boot probe succeeded","stderr":"","stdout":"","time":"2022-01-25T12:53:38.438366587Z"} 
 Jan 25 12:53:38 tordo.arvadosapi.com arvados-dispatch-cloud[15971]: {"Address":"10.253.254.49","ClusterID":"tordo","Instance":"i-06cce6b3e0448b10d","InstanceType":"t3small","PID":15971,"cmd":"sudo sh -c 'set -e; dstdir=\"/var/lib/arvados/\"; dstfile=\"/var/lib/arvados/crunch-run~89bea761309098144f11941ae52673f2\"; mkdir -p \"$dstdir\"; touch \"$dstfile\"; chmod 0755 \"$dstdir\" \"$dstfile\"; cat \u003e\"$dstfile\"'","hash":"89bea761309098144f11941ae52673f2","level":"info","msg":"installing runner binary on worker","path":"/var/lib/arvados/crunch-run~89bea761309098144f11941ae52673f2","time":"2022-01-25T12:53:38.471727831Z"} 
 Jan 25 12:53:39 tordo.arvadosapi.com arvados-dispatch-cloud[15971]: {"Address":"10.253.254.49","ClusterID":"tordo","Instance":"i-06cce6b3e0448b10d","InstanceType":"t3small","PID":15971,"ProbeStart":"2022-01-25T12:53:30.678564128Z","level":"info","msg":"instance booted; will try probeRunning","time":"2022-01-25T12:53:39.371129567Z"} 
 Jan 25 12:53:39 tordo.arvadosapi.com arvados-dispatch-cloud[15971]: {"Address":"10.253.254.49","ClusterID":"tordo","Instance":"i-06cce6b3e0448b10d","InstanceType":"t3small","PID":15971,"ProbeStart":"2022-01-25T12:53:30.678564128Z","RunningContainers":0,"State":"idle","level":"info","msg":"probes succeeded, instance is in service","time":"2022-01-25T12:53:39.429678726Z"} 
 Jan 25 12:53:39 tordo.arvadosapi.com arvados-dispatch-cloud[15971]: {"Address":"10.253.254.49","ClusterID":"tordo","ContainerUUID":"tordo-dz642-2l2xwswnvbvc8gk","Instance":"i-06cce6b3e0448b10d","InstanceType":"t3small","PID":15971,"level":"info","msg":"crunch-run process started","time":"2022-01-25T12:53:39.471920534Z"} 
 </pre> 

 On the node itself, @crunch-run@ and @arv-mount@ are running but nothing more: 

 <pre> 
 root       18119    0.0    0.1    12512    3212 pts/0      R+     13:07     0:00                        \_ ps auxwf 
 admin        631    0.0    0.4    21140    8976 ?          Ss     Jan25     0:00 /lib/systemd/systemd --user 
 admin        632    0.0    0.1 104852    2364 ?          S      Jan25     0:00    \_ (sd-pam) 
 root        1148    0.1    2.5 1333156 50724 ?         Sl     Jan25     5:24 /var/lib/arvados/crunch-run~89bea761309098144f11941ae52673f2 -no-detach --detach --stdin-config --runtime-engine=singularity tordo-dz642-2l2xwswnvbvc8gk 
 root        1161    0.1    2.4 1311492 48800 ?         Sl     Jan25     4:36    \_ /usr/share/python3/dist/python3-arvados-fuse/bin/python /usr/bin/arv-mount --foreground --read-write --storage-classes default --crunchstat-interval=10 --file-cache 268435456 --mount-by-pdh by_id --disable-event-listening --mount-by-id by_uuid /tmp/crunch-run.tordo-dz642-2l2xwswnvbvc8gk.4167237599/keep3068660272 
 root        1307    0.0    4.3 865924 87296 ?          Ssl    Jan25     0:37 /usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock --default-ulimit nofile=10000:10000 --dns 10.253.0.2 
 root        2104    0.0    1.1    32536 22200 ?          Ss     Jan25     0:00 /usr/share/python3/dist/arvados-docker-cleaner/bin/python /usr/bin/arvados-docker-cleaner 
 </pre> 

 <pre> 
 root@ip-10-253-254-49:/tmp# docker ps -a 
 CONTAINER ID     IMAGE       COMMAND     CREATED     STATUS      PORTS       NAMES 
 </pre> 

 <pre> 
 root@ip-10-253-254-49:/tmp# v 
 total 4 
 drwxrwxrwt    4 root root     48 Jan 27 08:25 ./ 
 drwxr-xr-x 18 root root 4096 Jan 25 12:53 ../ 
 drwx--x--x 14 root root    182 Jan 25 12:53 docker-data/ 
 drwxr-xr-x    2 root root     18 Jan 25 12:54 hsperfdata_root/ 
 </pre> 

 The container (tordo-dz642-2l2xwswnvbvc8gk) is just one part of our standard test suite:  

 <pre> 
 { 
   "uuid": "tordo-dz642-2l2xwswnvbvc8gk", 
   "owner_uuid": "tordo-tpzed-000000000000000", 
   "created_at": "2022-01-25T12:53:29.978Z", 
   "modified_at": "2022-01-27T12:58:31.556Z", 
   "modified_by_client_uuid": "tordo-ozdt8-q6dzdi1lcc03155", 
   "modified_by_user_uuid": "tordo-tpzed-000000000000000", 
   "state": "Locked", 
   "started_at": null, 
   "finished_at": null, 
   "log": "58d5e0dd2f63cc85d9146130dc10c54c+13282", 
   "environment": { 
     "HOME": "/var/spool/cwl", 
     "TMPDIR": "/tmp" 
   }, 
   "cwd": "/var/spool/cwl", 
   "command": [ 
     "/bin/sh", 
     "-c", 
     "echo \"HOME=$HOME\" \"TMPDIR=$TMPDIR\" && test \"$HOME\" = /var/spool/cwl -a \"$TMPDIR\" = /tmp" 
   ], 
   "output_path": "/var/spool/cwl", 
   "mounts": { 
     "/tmp": { 
       "capacity": 1073741824, 
       "kind": "tmp" 
     }, 
     "/var/spool/cwl": { 
       "capacity": 1073741824, 
       "kind": "tmp" 
     } 
   }, 
   "runtime_constraints": { 
     "API": false, 
     "cuda": { 
       "device_count": 0, 
       "driver_version": "", 
       "hardware_capability": "" 
     }, 
     "keep_cache_ram": 268435456, 
     "ram": 268435456, 
     "vcpus": 1 
   }, 
   "output": null, 
   "container_image": "021e994505b006982494a7caf0cedd1d+261", 
   "progress": null, 
   "priority": 562948310306102004, 
   "updated_at": null, 
   "exit_code": null, 
   "auth_uuid": "tordo-gj3su-rxkhzwtfimcmuij", 
   "locked_by_uuid": "tordo-gj3su-000000000000000", 
   "scheduling_parameters": { 
     "max_run_time": 0, 
     "partitions": [ 

     ], 
     "preemptible": false 
   }, 
   "runtime_status": { 
   }, 
   "runtime_user_uuid": "ce8i5-tpzed-yzrv3k3xiq86td0", 
   "runtime_auth_scopes": [ 
     "all" 
   ], 
   "runtime_token": null, 
   "lock_count": 1, 
   "gateway_address": null, 
   "interactive_session_started": false, 
   "output_storage_classes": [ 
     "default" 
   ] 
 } 
 </pre>

Back