Project

General

Profile

Actions

Bug #18686

closed

[a-d-c] compute node stuck

Added by Ward Vandewege about 2 years ago. Updated about 2 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Story points:
-

Description

This is on Tordo (AWS). A compute node has been running for over 2 days:

root@ip-10-253-254-49:/home/admin# uptime 
 13:05:39 up 2 days, 12 min,  1 user,  load average: 0.02, 0.03, 0.00

The AWS ID is i-06cce6b3e0448b10d at 10.253.254.49.

The a-d-c logs say:

tordo:~# journalctl -u arvados-dispatch-cloud.service -n100000|grep i-06cce6b3e0448b10d
Jan 25 12:53:02 tordo.arvadosapi.com arvados-dispatch-cloud[15971]: {"Address":"10.253.254.49","ClusterID":"tordo","IdleBehavior":"run","Instance":"i-06cce6b3e0448b10d","InstanceType":"t3small","PID":15971,"State":"booting","level":"info","msg":"instance appeared in cloud","time":"2022-01-25T12:53:02.659570016Z"}
Jan 25 12:53:38 tordo.arvadosapi.com arvados-dispatch-cloud[15971]: {"Address":"10.253.254.49","ClusterID":"tordo","Command":"sudo docker ps -q","Instance":"i-06cce6b3e0448b10d","InstanceType":"t3small","PID":15971,"level":"info","msg":"boot probe succeeded","stderr":"","stdout":"","time":"2022-01-25T12:53:38.438366587Z"}
Jan 25 12:53:38 tordo.arvadosapi.com arvados-dispatch-cloud[15971]: {"Address":"10.253.254.49","ClusterID":"tordo","Instance":"i-06cce6b3e0448b10d","InstanceType":"t3small","PID":15971,"cmd":"sudo sh -c 'set -e; dstdir=\"/var/lib/arvados/\"; dstfile=\"/var/lib/arvados/crunch-run~89bea761309098144f11941ae52673f2\"; mkdir -p \"$dstdir\"; touch \"$dstfile\"; chmod 0755 \"$dstdir\" \"$dstfile\"; cat \u003e\"$dstfile\"'","hash":"89bea761309098144f11941ae52673f2","level":"info","msg":"installing runner binary on worker","path":"/var/lib/arvados/crunch-run~89bea761309098144f11941ae52673f2","time":"2022-01-25T12:53:38.471727831Z"}
Jan 25 12:53:39 tordo.arvadosapi.com arvados-dispatch-cloud[15971]: {"Address":"10.253.254.49","ClusterID":"tordo","Instance":"i-06cce6b3e0448b10d","InstanceType":"t3small","PID":15971,"ProbeStart":"2022-01-25T12:53:30.678564128Z","level":"info","msg":"instance booted; will try probeRunning","time":"2022-01-25T12:53:39.371129567Z"}
Jan 25 12:53:39 tordo.arvadosapi.com arvados-dispatch-cloud[15971]: {"Address":"10.253.254.49","ClusterID":"tordo","Instance":"i-06cce6b3e0448b10d","InstanceType":"t3small","PID":15971,"ProbeStart":"2022-01-25T12:53:30.678564128Z","RunningContainers":0,"State":"idle","level":"info","msg":"probes succeeded, instance is in service","time":"2022-01-25T12:53:39.429678726Z"}
Jan 25 12:53:39 tordo.arvadosapi.com arvados-dispatch-cloud[15971]: {"Address":"10.253.254.49","ClusterID":"tordo","ContainerUUID":"tordo-dz642-2l2xwswnvbvc8gk","Instance":"i-06cce6b3e0448b10d","InstanceType":"t3small","PID":15971,"level":"info","msg":"crunch-run process started","time":"2022-01-25T12:53:39.471920534Z"}

On the node itself, crunch-run and arv-mount are running but nothing more:

root     18119  0.0  0.1  12512  3212 pts/0    R+   13:07   0:00                      \_ ps auxwf
admin      631  0.0  0.4  21140  8976 ?        Ss   Jan25   0:00 /lib/systemd/systemd --user
admin      632  0.0  0.1 104852  2364 ?        S    Jan25   0:00  \_ (sd-pam)
root      1148  0.1  2.5 1333156 50724 ?       Sl   Jan25   5:24 /var/lib/arvados/crunch-run~89bea761309098144f11941ae52673f2 -no-detach --detach --stdin-config --runtime-engine=singularity tordo-dz642-2l2xwswnvbvc8gk
root      1161  0.1  2.4 1311492 48800 ?       Sl   Jan25   4:36  \_ /usr/share/python3/dist/python3-arvados-fuse/bin/python /usr/bin/arv-mount --foreground --read-write --storage-classes default --crunchstat-interval=10 --file-cache 268435456 --mount-by-pdh by_id --disable-event-listening --mount-by-id by_uuid /tmp/crunch-run.tordo-dz642-2l2xwswnvbvc8gk.4167237599/keep3068660272
root      1307  0.0  4.3 865924 87296 ?        Ssl  Jan25   0:37 /usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock --default-ulimit nofile=10000:10000 --dns 10.253.0.2
root      2104  0.0  1.1  32536 22200 ?        Ss   Jan25   0:00 /usr/share/python3/dist/arvados-docker-cleaner/bin/python /usr/bin/arvados-docker-cleaner
root@ip-10-253-254-49:/tmp# docker ps -a
CONTAINER ID   IMAGE     COMMAND   CREATED   STATUS    PORTS     NAMES
root@ip-10-253-254-49:/tmp# v
total 4
drwxrwxrwt  4 root root   48 Jan 27 08:25 ./
drwxr-xr-x 18 root root 4096 Jan 25 12:53 ../
drwx--x--x 14 root root  182 Jan 25 12:53 docker-data/
drwxr-xr-x  2 root root   18 Jan 25 12:54 hsperfdata_root/

The container (tordo-dz642-2l2xwswnvbvc8gk) is just one part of our standard test suite:

{
  "uuid": "tordo-dz642-2l2xwswnvbvc8gk",
  "owner_uuid": "tordo-tpzed-000000000000000",
  "created_at": "2022-01-25T12:53:29.978Z",
  "modified_at": "2022-01-27T12:58:31.556Z",
  "modified_by_client_uuid": "tordo-ozdt8-q6dzdi1lcc03155",
  "modified_by_user_uuid": "tordo-tpzed-000000000000000",
  "state": "Locked",
  "started_at": null,
  "finished_at": null,
  "log": "58d5e0dd2f63cc85d9146130dc10c54c+13282",
  "environment": {
    "HOME": "/var/spool/cwl",
    "TMPDIR": "/tmp" 
  },
  "cwd": "/var/spool/cwl",
  "command": [
    "/bin/sh",
    "-c",
    "echo \"HOME=$HOME\" \"TMPDIR=$TMPDIR\" && test \"$HOME\" = /var/spool/cwl -a \"$TMPDIR\" = /tmp" 
  ],
  "output_path": "/var/spool/cwl",
  "mounts": {
    "/tmp": {
      "capacity": 1073741824,
      "kind": "tmp" 
    },
    "/var/spool/cwl": {
      "capacity": 1073741824,
      "kind": "tmp" 
    }
  },
  "runtime_constraints": {
    "API": false,
    "cuda": {
      "device_count": 0,
      "driver_version": "",
      "hardware_capability": "" 
    },
    "keep_cache_ram": 268435456,
    "ram": 268435456,
    "vcpus": 1
  },
  "output": null,
  "container_image": "021e994505b006982494a7caf0cedd1d+261",
  "progress": null,
  "priority": 562948310306102004,
  "updated_at": null,
  "exit_code": null,
  "auth_uuid": "tordo-gj3su-rxkhzwtfimcmuij",
  "locked_by_uuid": "tordo-gj3su-000000000000000",
  "scheduling_parameters": {
    "max_run_time": 0,
    "partitions": [

    ],
    "preemptible": false
  },
  "runtime_status": {
  },
  "runtime_user_uuid": "ce8i5-tpzed-yzrv3k3xiq86td0",
  "runtime_auth_scopes": [
    "all" 
  ],
  "runtime_token": null,
  "lock_count": 1,
  "gateway_address": null,
  "interactive_session_started": false,
  "output_storage_classes": [
    "default" 
  ]
}

Versions:

root@ip-10-253-254-49:/usr/src# dpkg -l |grep arv
ii  arvados-docker-cleaner              2.3.0~dev20210729201354-1    amd64        Arvados Docker cleaner
ii  python3-arvados-fuse                2.4.0~dev20211126162134-1    amd64        Arvados FUSE driver
root@ip-10-253-254-49:/usr/src# dpkg -l |grep cru
ii  crunch-run                          2.4.0~dev20211230190012-1    amd64        Supervise a single Crunch container

Clearly we need to rebuild the image to get the latest arvados-docker-cleaner version but the fuse driver is up to date.

The arv-mount parent process and two of the (33!) children are stuck on FUTEX_WAIT:


# strace -f -p 1161
strace: Process 1161 attached with 34 threads
[pid  1275] read(5,  <unfinished ...>
[pid  1273] read(5,  <unfinished ...>
[pid  1272] read(5,  <unfinished ...>
[pid  1274] read(5,  <unfinished ...>
[pid  1271] read(5,  <unfinished ...>
[pid  1270] read(5,  <unfinished ...>
[pid  1269] read(5,  <unfinished ...>
[pid  1268] read(5,  <unfinished ...>
[pid  1263] read(5,  <unfinished ...>
[pid  1267] read(5,  <unfinished ...>
[pid  1266] read(5,  <unfinished ...>
[pid  1265] read(5,  <unfinished ...>
[pid  1264] read(5,  <unfinished ...>
[pid  1262] read(5,  <unfinished ...>
[pid  1261] read(5,  <unfinished ...>
[pid  1259] read(5,  <unfinished ...>
[pid  1260] read(5,  <unfinished ...>
[pid  1258] read(5,  <unfinished ...>
[pid  1257] read(5,  <unfinished ...>
[pid  1256] read(5,  <unfinished ...>
[pid  1250] read(5,  <unfinished ...>
[pid  1255] read(5,  <unfinished ...>
[pid  1254] read(5,  <unfinished ...>
[pid  1249] read(5,  <unfinished ...>
[pid  1248] read(5,  <unfinished ...>
[pid  1247] read(5,  <unfinished ...>
[pid  1245] read(5,  <unfinished ...>
[pid  1243] read(5,  <unfinished ...>
[pid  1244] read(5,  <unfinished ...>
[pid  1242] read(5,  <unfinished ...>
[pid  1241] futex(0x7ff9080013f0, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid  1239] futex(0x7ff911b05170, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid  1232] select(0, NULL, NULL, NULL, {tv_sec=9, tv_usec=562958} <unfinished ...>
[pid  1161] futex(0x7ff904000d50, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY

That's the C backtrace, not so useful. To get the python backtrace:

# apt install gdb python3-dbg
# cd /usr/share/python3/dist/python3-arvados-fuse
# . bin/activate
# gdb python 1161
GNU gdb (Debian 8.2.1-2+b3) 8.2.1
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from python...Reading symbols from /usr/lib/debug/.build-id/99/21c75e6930d3e9d9fa8c942aca9dc4500bb65f.debug...done.
done.
Attaching to program: /usr/share/python3/dist/python3-arvados-fuse/bin/python, process 1161
[New LWP 1232]
[New LWP 1239]
[New LWP 1241]
[New LWP 1242]
[New LWP 1243]
[New LWP 1244]
[New LWP 1245]
[New LWP 1247]
[New LWP 1248]
[New LWP 1249]
[New LWP 1250]
[New LWP 1254]
[New LWP 1255]
[New LWP 1256]
[New LWP 1257]
[New LWP 1258]
[New LWP 1259]
[New LWP 1260]
[New LWP 1261]
[New LWP 1262]
[New LWP 1263]
[New LWP 1264]
[New LWP 1265]
[New LWP 1266]
[New LWP 1267]
[New LWP 1268]
[New LWP 1269]
[New LWP 1270]
[New LWP 1271]
[New LWP 1272]
[New LWP 1273]
[New LWP 1274]
[New LWP 1275]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x7ff904000d50) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
205    ../sysdeps/unix/sysv/linux/futex-internal.h: No such file or directory.
(gdb) source /usr/share/gdb/auto-load/usr/bin/python3.7-gdb.py

And then:

(gdb) py-bt
Traceback (most recent call first):
  File "/usr/lib/python3.7/threading.py", line 1048, in _wait_for_tstate_lock
    elif lock.acquire(block, timeout):
  File "/usr/lib/python3.7/threading.py", line 1032, in join
    self._wait_for_tstate_lock()
  File "/usr/share/python3/dist/python3-arvados-fuse/lib/python3.7/site-packages/arvados_fuse/command.py", line 387, in _run_standalone
    self.llfuse_thread.join(timeout=None)
  File "/usr/share/python3/dist/python3-arvados-fuse/lib/python3.7/site-packages/arvados_fuse/command.py", line 176, in run
    self._run_standalone()
  File "/usr/bin/arv-mount", line 10, in <module>
    arvados_fuse.command.Mount(args).run()

That backtrace does not seem super helpful, though.


Files

tordo-dz642-2l2xwswnvbvc8gk.logs.bz2 (2.32 MB) tordo-dz642-2l2xwswnvbvc8gk.logs.bz2 tordo-dz642-2l2xwswnvbvc8gk livelogs from database Ward Vandewege, 01/27/2022 10:06 PM
Actions

Also available in: Atom PDF