Project

General

Profile

Bug #18686

Updated by Ward Vandewege about 2 years ago

This is on Tordo (AWS). A compute node has been running for over 2 days: 

 <pre> 
 root@ip-10-253-254-49:/home/admin# uptime  
  13:05:39 up 2 days, 12 min,    1 user,    load average: 0.02, 0.03, 0.00 
 </pre> 

 The AWS ID is i-06cce6b3e0448b10d at 10.253.254.49. 

 The a-d-c logs say: 

 <pre> 
 tordo:~# journalctl -u arvados-dispatch-cloud.service -n100000|grep i-06cce6b3e0448b10d 
 Jan 25 12:53:02 tordo.arvadosapi.com arvados-dispatch-cloud[15971]: {"Address":"10.253.254.49","ClusterID":"tordo","IdleBehavior":"run","Instance":"i-06cce6b3e0448b10d","InstanceType":"t3small","PID":15971,"State":"booting","level":"info","msg":"instance appeared in cloud","time":"2022-01-25T12:53:02.659570016Z"} 
 Jan 25 12:53:38 tordo.arvadosapi.com arvados-dispatch-cloud[15971]: {"Address":"10.253.254.49","ClusterID":"tordo","Command":"sudo docker ps -q","Instance":"i-06cce6b3e0448b10d","InstanceType":"t3small","PID":15971,"level":"info","msg":"boot probe succeeded","stderr":"","stdout":"","time":"2022-01-25T12:53:38.438366587Z"} 
 Jan 25 12:53:38 tordo.arvadosapi.com arvados-dispatch-cloud[15971]: {"Address":"10.253.254.49","ClusterID":"tordo","Instance":"i-06cce6b3e0448b10d","InstanceType":"t3small","PID":15971,"cmd":"sudo sh -c 'set -e; dstdir=\"/var/lib/arvados/\"; dstfile=\"/var/lib/arvados/crunch-run~89bea761309098144f11941ae52673f2\"; mkdir -p \"$dstdir\"; touch \"$dstfile\"; chmod 0755 \"$dstdir\" \"$dstfile\"; cat \u003e\"$dstfile\"'","hash":"89bea761309098144f11941ae52673f2","level":"info","msg":"installing runner binary on worker","path":"/var/lib/arvados/crunch-run~89bea761309098144f11941ae52673f2","time":"2022-01-25T12:53:38.471727831Z"} 
 Jan 25 12:53:39 tordo.arvadosapi.com arvados-dispatch-cloud[15971]: {"Address":"10.253.254.49","ClusterID":"tordo","Instance":"i-06cce6b3e0448b10d","InstanceType":"t3small","PID":15971,"ProbeStart":"2022-01-25T12:53:30.678564128Z","level":"info","msg":"instance booted; will try probeRunning","time":"2022-01-25T12:53:39.371129567Z"} 
 Jan 25 12:53:39 tordo.arvadosapi.com arvados-dispatch-cloud[15971]: {"Address":"10.253.254.49","ClusterID":"tordo","Instance":"i-06cce6b3e0448b10d","InstanceType":"t3small","PID":15971,"ProbeStart":"2022-01-25T12:53:30.678564128Z","RunningContainers":0,"State":"idle","level":"info","msg":"probes succeeded, instance is in service","time":"2022-01-25T12:53:39.429678726Z"} 
 Jan 25 12:53:39 tordo.arvadosapi.com arvados-dispatch-cloud[15971]: {"Address":"10.253.254.49","ClusterID":"tordo","ContainerUUID":"tordo-dz642-2l2xwswnvbvc8gk","Instance":"i-06cce6b3e0448b10d","InstanceType":"t3small","PID":15971,"level":"info","msg":"crunch-run process started","time":"2022-01-25T12:53:39.471920534Z"} 
 </pre> 

 On the node itself, @crunch-run@ and @arv-mount@ are running but nothing more: 

 <pre> 
 root       18119    0.0    0.1    12512    3212 pts/0      R+     13:07     0:00                        \_ ps auxwf 
 admin        631    0.0    0.4    21140    8976 ?          Ss     Jan25     0:00 /lib/systemd/systemd --user 
 admin        632    0.0    0.1 104852    2364 ?          S      Jan25     0:00    \_ (sd-pam) 
 root        1148    0.1    2.5 1333156 50724 ?         Sl     Jan25     5:24 /var/lib/arvados/crunch-run~89bea761309098144f11941ae52673f2 -no-detach --detach --stdin-config --runtime-engine=singularity tordo-dz642-2l2xwswnvbvc8gk 
 root        1161    0.1    2.4 1311492 48800 ?         Sl     Jan25     4:36    \_ /usr/share/python3/dist/python3-arvados-fuse/bin/python /usr/bin/arv-mount --foreground --read-write --storage-classes default --crunchstat-interval=10 --file-cache 268435456 --mount-by-pdh by_id --disable-event-listening --mount-by-id by_uuid /tmp/crunch-run.tordo-dz642-2l2xwswnvbvc8gk.4167237599/keep3068660272 
 root        1307    0.0    4.3 865924 87296 ?          Ssl    Jan25     0:37 /usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock --default-ulimit nofile=10000:10000 --dns 10.253.0.2 
 root        2104    0.0    1.1    32536 22200 ?          Ss     Jan25     0:00 /usr/share/python3/dist/arvados-docker-cleaner/bin/python /usr/bin/arvados-docker-cleaner 
 </pre> 

 <pre> 
 root@ip-10-253-254-49:/tmp# docker ps -a 
 CONTAINER ID     IMAGE       COMMAND     CREATED     STATUS      PORTS       NAMES 
 </pre> 

 <pre> 
 root@ip-10-253-254-49:/tmp# v 
 total 4 
 drwxrwxrwt    4 root root     48 Jan 27 08:25 ./ 
 drwxr-xr-x 18 root root 4096 Jan 25 12:53 ../ 
 drwx--x--x 14 root root    182 Jan 25 12:53 docker-data/ 
 drwxr-xr-x    2 root root     18 Jan 25 12:54 hsperfdata_root/ 
 </pre> 

 The container (tordo-dz642-2l2xwswnvbvc8gk) is just one part of our standard test suite:  

 <pre> 
 { 
   "uuid": "tordo-dz642-2l2xwswnvbvc8gk", 
   "owner_uuid": "tordo-tpzed-000000000000000", 
   "created_at": "2022-01-25T12:53:29.978Z", 
   "modified_at": "2022-01-27T12:58:31.556Z", 
   "modified_by_client_uuid": "tordo-ozdt8-q6dzdi1lcc03155", 
   "modified_by_user_uuid": "tordo-tpzed-000000000000000", 
   "state": "Locked", 
   "started_at": null, 
   "finished_at": null, 
   "log": "58d5e0dd2f63cc85d9146130dc10c54c+13282", 
   "environment": { 
     "HOME": "/var/spool/cwl", 
     "TMPDIR": "/tmp" 
   }, 
   "cwd": "/var/spool/cwl", 
   "command": [ 
     "/bin/sh", 
     "-c", 
     "echo \"HOME=$HOME\" \"TMPDIR=$TMPDIR\" && test \"$HOME\" = /var/spool/cwl -a \"$TMPDIR\" = /tmp" 
   ], 
   "output_path": "/var/spool/cwl", 
   "mounts": { 
     "/tmp": { 
       "capacity": 1073741824, 
       "kind": "tmp" 
     }, 
     "/var/spool/cwl": { 
       "capacity": 1073741824, 
       "kind": "tmp" 
     } 
   }, 
   "runtime_constraints": { 
     "API": false, 
     "cuda": { 
       "device_count": 0, 
       "driver_version": "", 
       "hardware_capability": "" 
     }, 
     "keep_cache_ram": 268435456, 
     "ram": 268435456, 
     "vcpus": 1 
   }, 
   "output": null, 
   "container_image": "021e994505b006982494a7caf0cedd1d+261", 
   "progress": null, 
   "priority": 562948310306102004, 
   "updated_at": null, 
   "exit_code": null, 
   "auth_uuid": "tordo-gj3su-rxkhzwtfimcmuij", 
   "locked_by_uuid": "tordo-gj3su-000000000000000", 
   "scheduling_parameters": { 
     "max_run_time": 0, 
     "partitions": [ 

     ], 
     "preemptible": false 
   }, 
   "runtime_status": { 
   }, 
   "runtime_user_uuid": "ce8i5-tpzed-yzrv3k3xiq86td0", 
   "runtime_auth_scopes": [ 
     "all" 
   ], 
   "runtime_token": null, 
   "lock_count": 1, 
   "gateway_address": null, 
   "interactive_session_started": false, 
   "output_storage_classes": [ 
     "default" 
   ] 
 } 
 </pre> 

 Versions: 

 <pre> 
 root@ip-10-253-254-49:/usr/src# dpkg -l |grep arv 
 ii    arvados-docker-cleaner                2.3.0~dev20210729201354-1      amd64          Arvados Docker cleaner 
 ii    python3-arvados-fuse                  2.4.0~dev20211126162134-1      amd64          Arvados FUSE driver 
 root@ip-10-253-254-49:/usr/src# dpkg -l |grep cru 
 ii    crunch-run                            2.4.0~dev20211230190012-1      amd64          Supervise a single Crunch container 
 </pre> 

 Clearly we need to rebuild the image to get the latest arvados-docker-cleaner version but the fuse driver is up to date. 

 The arv-mount parent process and two of the (33!) children are stuck on FUTEX_WAIT: 

 <pre> 


 # strace -f -p 1161 
 strace: Process 1161 attached with 34 threads 
 [pid    1275] read(5,    <unfinished ...> 
 [pid    1273] read(5,    <unfinished ...> 
 [pid    1272] read(5,    <unfinished ...> 
 [pid    1274] read(5,    <unfinished ...> 
 [pid    1271] read(5,    <unfinished ...> 
 [pid    1270] read(5,    <unfinished ...> 
 [pid    1269] read(5,    <unfinished ...> 
 [pid    1268] read(5,    <unfinished ...> 
 [pid    1263] read(5,    <unfinished ...> 
 [pid    1267] read(5,    <unfinished ...> 
 [pid    1266] read(5,    <unfinished ...> 
 [pid    1265] read(5,    <unfinished ...> 
 [pid    1264] read(5,    <unfinished ...> 
 [pid    1262] read(5,    <unfinished ...> 
 [pid    1261] read(5,    <unfinished ...> 
 [pid    1259] read(5,    <unfinished ...> 
 [pid    1260] read(5,    <unfinished ...> 
 [pid    1258] read(5,    <unfinished ...> 
 [pid    1257] read(5,    <unfinished ...> 
 [pid    1256] read(5,    <unfinished ...> 
 [pid    1250] read(5,    <unfinished ...> 
 [pid    1255] read(5,    <unfinished ...> 
 [pid    1254] read(5,    <unfinished ...> 
 [pid    1249] read(5,    <unfinished ...> 
 [pid    1248] read(5,    <unfinished ...> 
 [pid    1247] read(5,    <unfinished ...> 
 [pid    1245] read(5,    <unfinished ...> 
 [pid    1243] read(5,    <unfinished ...> 
 [pid    1244] read(5,    <unfinished ...> 
 [pid    1242] read(5,    <unfinished ...> 
 [pid    1241] futex(0x7ff9080013f0, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY <unfinished ...> 
 [pid    1239] futex(0x7ff911b05170, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY <unfinished ...> 
 [pid    1232] select(0, NULL, NULL, NULL, {tv_sec=9, tv_usec=562958} <unfinished ...> 
 [pid    1161] futex(0x7ff904000d50, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY 
 </pre> 

 That's the C backtrace, not so useful. To get the python backtrace: 

 <pre> 
 # apt install After some fiddling with gdb python3-dbg 
 # cd /usr/share/python3/dist/python3-arvados-fuse 
 # . bin/activate 
 </pre> 

 <pre> 
  and python's gdb python 1161 
 GNU gdb (Debian 8.2.1-2+b3) 8.2.1 
 Copyright (C) 2018 Free Software Foundation, Inc. 
 License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> 
 This is free software: you are free to change and redistribute it. 
 There is NO WARRANTY, to the extent permitted by law. 
 Type "show copying" and "show warranty" for details. 
 This GDB was configured as "x86_64-linux-gnu". 
 Type "show configuration" for configuration details. 
 For bug reporting instructions, please see: 
 <http://www.gnu.org/software/gdb/bugs/>. 
 Find the GDB manual and other documentation resources online at: 
     <http://www.gnu.org/software/gdb/documentation/>. 

 For help, type "help". 
 Type "apropos word" to search for commands related to "word"... 
 Reading symbols from python...Reading symbols from /usr/lib/debug/.build-id/99/21c75e6930d3e9d9fa8c942aca9dc4500bb65f.debug...done. 
 done. 
 Attaching to program: /usr/share/python3/dist/python3-arvados-fuse/bin/python, process 1161 
 [New LWP 1232] 
 [New LWP 1239] 
 [New LWP 1241] 
 [New LWP 1242] 
 [New LWP 1243] 
 [New LWP 1244] 
 [New LWP 1245] 
 [New LWP 1247] 
 [New LWP 1248] 
 [New LWP 1249] 
 [New LWP 1250] 
 [New LWP 1254] 
 [New LWP 1255] 
 [New LWP 1256] 
 [New LWP 1257] 
 [New LWP 1258] 
 [New LWP 1259] 
 [New LWP 1260] 
 [New LWP 1261] 
 [New LWP 1262] 
 [New LWP 1263] 
 [New LWP 1264] 
 [New LWP 1265] 
 [New LWP 1266] 
 [New LWP 1267] 
 [New LWP 1268] 
 [New LWP 1269] 
 [New LWP 1270] 
 [New LWP 1271] 
 [New LWP 1272] 
 [New LWP 1273] 
 [New LWP 1274] 
 [New LWP 1275] 
 [Thread debugging using libthread_db enabled] 
 Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". 
 futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x7ff904000d50) at ../sysdeps/unix/sysv/linux/futex-internal.h:205 
 205 	 ../sysdeps/unix/sysv/linux/futex-internal.h: No such file or directory. 
 (gdb) source /usr/share/gdb/auto-load/usr/bin/python3.7-gdb.py 
 </pre> 

 And then: integration: 

 <pre> 
 (gdb) py-bt 
 Traceback (most recent call first): 
   File "/usr/lib/python3.7/threading.py", line 1048, in _wait_for_tstate_lock 
     elif lock.acquire(block, timeout): 
   File "/usr/lib/python3.7/threading.py", line 1032, in join 
     self._wait_for_tstate_lock() 
   File "/usr/share/python3/dist/python3-arvados-fuse/lib/python3.7/site-packages/arvados_fuse/command.py", line 387, in _run_standalone 
     self.llfuse_thread.join(timeout=None) 
   File "/usr/share/python3/dist/python3-arvados-fuse/lib/python3.7/site-packages/arvados_fuse/command.py", line 176, in run 
     self._run_standalone() 
   File "/usr/bin/arv-mount", line 10, in <module> 
     arvados_fuse.command.Mount(args).run() 
 </pre>

Back