Actions
Bug #18686
closed[a-d-c] compute node stuck
Story points:
-
Description
This is on Tordo (AWS). A compute node has been running for over 2 days:
root@ip-10-253-254-49:/home/admin# uptime 13:05:39 up 2 days, 12 min, 1 user, load average: 0.02, 0.03, 0.00
The AWS ID is i-06cce6b3e0448b10d at 10.253.254.49.
The a-d-c logs say:
tordo:~# journalctl -u arvados-dispatch-cloud.service -n100000|grep i-06cce6b3e0448b10d Jan 25 12:53:02 tordo.arvadosapi.com arvados-dispatch-cloud[15971]: {"Address":"10.253.254.49","ClusterID":"tordo","IdleBehavior":"run","Instance":"i-06cce6b3e0448b10d","InstanceType":"t3small","PID":15971,"State":"booting","level":"info","msg":"instance appeared in cloud","time":"2022-01-25T12:53:02.659570016Z"} Jan 25 12:53:38 tordo.arvadosapi.com arvados-dispatch-cloud[15971]: {"Address":"10.253.254.49","ClusterID":"tordo","Command":"sudo docker ps -q","Instance":"i-06cce6b3e0448b10d","InstanceType":"t3small","PID":15971,"level":"info","msg":"boot probe succeeded","stderr":"","stdout":"","time":"2022-01-25T12:53:38.438366587Z"} Jan 25 12:53:38 tordo.arvadosapi.com arvados-dispatch-cloud[15971]: {"Address":"10.253.254.49","ClusterID":"tordo","Instance":"i-06cce6b3e0448b10d","InstanceType":"t3small","PID":15971,"cmd":"sudo sh -c 'set -e; dstdir=\"/var/lib/arvados/\"; dstfile=\"/var/lib/arvados/crunch-run~89bea761309098144f11941ae52673f2\"; mkdir -p \"$dstdir\"; touch \"$dstfile\"; chmod 0755 \"$dstdir\" \"$dstfile\"; cat \u003e\"$dstfile\"'","hash":"89bea761309098144f11941ae52673f2","level":"info","msg":"installing runner binary on worker","path":"/var/lib/arvados/crunch-run~89bea761309098144f11941ae52673f2","time":"2022-01-25T12:53:38.471727831Z"} Jan 25 12:53:39 tordo.arvadosapi.com arvados-dispatch-cloud[15971]: {"Address":"10.253.254.49","ClusterID":"tordo","Instance":"i-06cce6b3e0448b10d","InstanceType":"t3small","PID":15971,"ProbeStart":"2022-01-25T12:53:30.678564128Z","level":"info","msg":"instance booted; will try probeRunning","time":"2022-01-25T12:53:39.371129567Z"} Jan 25 12:53:39 tordo.arvadosapi.com arvados-dispatch-cloud[15971]: {"Address":"10.253.254.49","ClusterID":"tordo","Instance":"i-06cce6b3e0448b10d","InstanceType":"t3small","PID":15971,"ProbeStart":"2022-01-25T12:53:30.678564128Z","RunningContainers":0,"State":"idle","level":"info","msg":"probes succeeded, instance is in service","time":"2022-01-25T12:53:39.429678726Z"} Jan 25 12:53:39 tordo.arvadosapi.com arvados-dispatch-cloud[15971]: {"Address":"10.253.254.49","ClusterID":"tordo","ContainerUUID":"tordo-dz642-2l2xwswnvbvc8gk","Instance":"i-06cce6b3e0448b10d","InstanceType":"t3small","PID":15971,"level":"info","msg":"crunch-run process started","time":"2022-01-25T12:53:39.471920534Z"}
On the node itself, crunch-run
and arv-mount
are running but nothing more:
root 18119 0.0 0.1 12512 3212 pts/0 R+ 13:07 0:00 \_ ps auxwf admin 631 0.0 0.4 21140 8976 ? Ss Jan25 0:00 /lib/systemd/systemd --user admin 632 0.0 0.1 104852 2364 ? S Jan25 0:00 \_ (sd-pam) root 1148 0.1 2.5 1333156 50724 ? Sl Jan25 5:24 /var/lib/arvados/crunch-run~89bea761309098144f11941ae52673f2 -no-detach --detach --stdin-config --runtime-engine=singularity tordo-dz642-2l2xwswnvbvc8gk root 1161 0.1 2.4 1311492 48800 ? Sl Jan25 4:36 \_ /usr/share/python3/dist/python3-arvados-fuse/bin/python /usr/bin/arv-mount --foreground --read-write --storage-classes default --crunchstat-interval=10 --file-cache 268435456 --mount-by-pdh by_id --disable-event-listening --mount-by-id by_uuid /tmp/crunch-run.tordo-dz642-2l2xwswnvbvc8gk.4167237599/keep3068660272 root 1307 0.0 4.3 865924 87296 ? Ssl Jan25 0:37 /usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock --default-ulimit nofile=10000:10000 --dns 10.253.0.2 root 2104 0.0 1.1 32536 22200 ? Ss Jan25 0:00 /usr/share/python3/dist/arvados-docker-cleaner/bin/python /usr/bin/arvados-docker-cleaner
root@ip-10-253-254-49:/tmp# docker ps -a CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
root@ip-10-253-254-49:/tmp# v total 4 drwxrwxrwt 4 root root 48 Jan 27 08:25 ./ drwxr-xr-x 18 root root 4096 Jan 25 12:53 ../ drwx--x--x 14 root root 182 Jan 25 12:53 docker-data/ drwxr-xr-x 2 root root 18 Jan 25 12:54 hsperfdata_root/
The container (tordo-dz642-2l2xwswnvbvc8gk) is just one part of our standard test suite:
{ "uuid": "tordo-dz642-2l2xwswnvbvc8gk", "owner_uuid": "tordo-tpzed-000000000000000", "created_at": "2022-01-25T12:53:29.978Z", "modified_at": "2022-01-27T12:58:31.556Z", "modified_by_client_uuid": "tordo-ozdt8-q6dzdi1lcc03155", "modified_by_user_uuid": "tordo-tpzed-000000000000000", "state": "Locked", "started_at": null, "finished_at": null, "log": "58d5e0dd2f63cc85d9146130dc10c54c+13282", "environment": { "HOME": "/var/spool/cwl", "TMPDIR": "/tmp" }, "cwd": "/var/spool/cwl", "command": [ "/bin/sh", "-c", "echo \"HOME=$HOME\" \"TMPDIR=$TMPDIR\" && test \"$HOME\" = /var/spool/cwl -a \"$TMPDIR\" = /tmp" ], "output_path": "/var/spool/cwl", "mounts": { "/tmp": { "capacity": 1073741824, "kind": "tmp" }, "/var/spool/cwl": { "capacity": 1073741824, "kind": "tmp" } }, "runtime_constraints": { "API": false, "cuda": { "device_count": 0, "driver_version": "", "hardware_capability": "" }, "keep_cache_ram": 268435456, "ram": 268435456, "vcpus": 1 }, "output": null, "container_image": "021e994505b006982494a7caf0cedd1d+261", "progress": null, "priority": 562948310306102004, "updated_at": null, "exit_code": null, "auth_uuid": "tordo-gj3su-rxkhzwtfimcmuij", "locked_by_uuid": "tordo-gj3su-000000000000000", "scheduling_parameters": { "max_run_time": 0, "partitions": [ ], "preemptible": false }, "runtime_status": { }, "runtime_user_uuid": "ce8i5-tpzed-yzrv3k3xiq86td0", "runtime_auth_scopes": [ "all" ], "runtime_token": null, "lock_count": 1, "gateway_address": null, "interactive_session_started": false, "output_storage_classes": [ "default" ] }
Versions:
root@ip-10-253-254-49:/usr/src# dpkg -l |grep arv ii arvados-docker-cleaner 2.3.0~dev20210729201354-1 amd64 Arvados Docker cleaner ii python3-arvados-fuse 2.4.0~dev20211126162134-1 amd64 Arvados FUSE driver root@ip-10-253-254-49:/usr/src# dpkg -l |grep cru ii crunch-run 2.4.0~dev20211230190012-1 amd64 Supervise a single Crunch container
Clearly we need to rebuild the image to get the latest arvados-docker-cleaner version but the fuse driver is up to date.
The arv-mount parent process and two of the (33!) children are stuck on FUTEX_WAIT:
# strace -f -p 1161 strace: Process 1161 attached with 34 threads [pid 1275] read(5, <unfinished ...> [pid 1273] read(5, <unfinished ...> [pid 1272] read(5, <unfinished ...> [pid 1274] read(5, <unfinished ...> [pid 1271] read(5, <unfinished ...> [pid 1270] read(5, <unfinished ...> [pid 1269] read(5, <unfinished ...> [pid 1268] read(5, <unfinished ...> [pid 1263] read(5, <unfinished ...> [pid 1267] read(5, <unfinished ...> [pid 1266] read(5, <unfinished ...> [pid 1265] read(5, <unfinished ...> [pid 1264] read(5, <unfinished ...> [pid 1262] read(5, <unfinished ...> [pid 1261] read(5, <unfinished ...> [pid 1259] read(5, <unfinished ...> [pid 1260] read(5, <unfinished ...> [pid 1258] read(5, <unfinished ...> [pid 1257] read(5, <unfinished ...> [pid 1256] read(5, <unfinished ...> [pid 1250] read(5, <unfinished ...> [pid 1255] read(5, <unfinished ...> [pid 1254] read(5, <unfinished ...> [pid 1249] read(5, <unfinished ...> [pid 1248] read(5, <unfinished ...> [pid 1247] read(5, <unfinished ...> [pid 1245] read(5, <unfinished ...> [pid 1243] read(5, <unfinished ...> [pid 1244] read(5, <unfinished ...> [pid 1242] read(5, <unfinished ...> [pid 1241] futex(0x7ff9080013f0, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY <unfinished ...> [pid 1239] futex(0x7ff911b05170, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY <unfinished ...> [pid 1232] select(0, NULL, NULL, NULL, {tv_sec=9, tv_usec=562958} <unfinished ...> [pid 1161] futex(0x7ff904000d50, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY
That's the C backtrace, not so useful. To get the python backtrace:
# apt install gdb python3-dbg # cd /usr/share/python3/dist/python3-arvados-fuse # . bin/activate
# gdb python 1161 GNU gdb (Debian 8.2.1-2+b3) 8.2.1 Copyright (C) 2018 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-linux-gnu". Type "show configuration" for configuration details. For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>. Find the GDB manual and other documentation resources online at: <http://www.gnu.org/software/gdb/documentation/>. For help, type "help". Type "apropos word" to search for commands related to "word"... Reading symbols from python...Reading symbols from /usr/lib/debug/.build-id/99/21c75e6930d3e9d9fa8c942aca9dc4500bb65f.debug...done. done. Attaching to program: /usr/share/python3/dist/python3-arvados-fuse/bin/python, process 1161 [New LWP 1232] [New LWP 1239] [New LWP 1241] [New LWP 1242] [New LWP 1243] [New LWP 1244] [New LWP 1245] [New LWP 1247] [New LWP 1248] [New LWP 1249] [New LWP 1250] [New LWP 1254] [New LWP 1255] [New LWP 1256] [New LWP 1257] [New LWP 1258] [New LWP 1259] [New LWP 1260] [New LWP 1261] [New LWP 1262] [New LWP 1263] [New LWP 1264] [New LWP 1265] [New LWP 1266] [New LWP 1267] [New LWP 1268] [New LWP 1269] [New LWP 1270] [New LWP 1271] [New LWP 1272] [New LWP 1273] [New LWP 1274] [New LWP 1275] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x7ff904000d50) at ../sysdeps/unix/sysv/linux/futex-internal.h:205 205 ../sysdeps/unix/sysv/linux/futex-internal.h: No such file or directory. (gdb) source /usr/share/gdb/auto-load/usr/bin/python3.7-gdb.py
And then:
(gdb) py-bt Traceback (most recent call first): File "/usr/lib/python3.7/threading.py", line 1048, in _wait_for_tstate_lock elif lock.acquire(block, timeout): File "/usr/lib/python3.7/threading.py", line 1032, in join self._wait_for_tstate_lock() File "/usr/share/python3/dist/python3-arvados-fuse/lib/python3.7/site-packages/arvados_fuse/command.py", line 387, in _run_standalone self.llfuse_thread.join(timeout=None) File "/usr/share/python3/dist/python3-arvados-fuse/lib/python3.7/site-packages/arvados_fuse/command.py", line 176, in run self._run_standalone() File "/usr/bin/arv-mount", line 10, in <module> arvados_fuse.command.Mount(args).run()
That backtrace does not seem super helpful, though.
Files
Actions