Project

General

Profile

Actions

Bug #18686

closed

[a-d-c] compute node stuck

Added by Ward Vandewege 5 months ago. Updated 5 months ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:
Story points:
-

Description

This is on Tordo (AWS). A compute node has been running for over 2 days:

root@ip-10-253-254-49:/home/admin# uptime 
 13:05:39 up 2 days, 12 min,  1 user,  load average: 0.02, 0.03, 0.00

The AWS ID is i-06cce6b3e0448b10d at 10.253.254.49.

The a-d-c logs say:

tordo:~# journalctl -u arvados-dispatch-cloud.service -n100000|grep i-06cce6b3e0448b10d
Jan 25 12:53:02 tordo.arvadosapi.com arvados-dispatch-cloud[15971]: {"Address":"10.253.254.49","ClusterID":"tordo","IdleBehavior":"run","Instance":"i-06cce6b3e0448b10d","InstanceType":"t3small","PID":15971,"State":"booting","level":"info","msg":"instance appeared in cloud","time":"2022-01-25T12:53:02.659570016Z"}
Jan 25 12:53:38 tordo.arvadosapi.com arvados-dispatch-cloud[15971]: {"Address":"10.253.254.49","ClusterID":"tordo","Command":"sudo docker ps -q","Instance":"i-06cce6b3e0448b10d","InstanceType":"t3small","PID":15971,"level":"info","msg":"boot probe succeeded","stderr":"","stdout":"","time":"2022-01-25T12:53:38.438366587Z"}
Jan 25 12:53:38 tordo.arvadosapi.com arvados-dispatch-cloud[15971]: {"Address":"10.253.254.49","ClusterID":"tordo","Instance":"i-06cce6b3e0448b10d","InstanceType":"t3small","PID":15971,"cmd":"sudo sh -c 'set -e; dstdir=\"/var/lib/arvados/\"; dstfile=\"/var/lib/arvados/crunch-run~89bea761309098144f11941ae52673f2\"; mkdir -p \"$dstdir\"; touch \"$dstfile\"; chmod 0755 \"$dstdir\" \"$dstfile\"; cat \u003e\"$dstfile\"'","hash":"89bea761309098144f11941ae52673f2","level":"info","msg":"installing runner binary on worker","path":"/var/lib/arvados/crunch-run~89bea761309098144f11941ae52673f2","time":"2022-01-25T12:53:38.471727831Z"}
Jan 25 12:53:39 tordo.arvadosapi.com arvados-dispatch-cloud[15971]: {"Address":"10.253.254.49","ClusterID":"tordo","Instance":"i-06cce6b3e0448b10d","InstanceType":"t3small","PID":15971,"ProbeStart":"2022-01-25T12:53:30.678564128Z","level":"info","msg":"instance booted; will try probeRunning","time":"2022-01-25T12:53:39.371129567Z"}
Jan 25 12:53:39 tordo.arvadosapi.com arvados-dispatch-cloud[15971]: {"Address":"10.253.254.49","ClusterID":"tordo","Instance":"i-06cce6b3e0448b10d","InstanceType":"t3small","PID":15971,"ProbeStart":"2022-01-25T12:53:30.678564128Z","RunningContainers":0,"State":"idle","level":"info","msg":"probes succeeded, instance is in service","time":"2022-01-25T12:53:39.429678726Z"}
Jan 25 12:53:39 tordo.arvadosapi.com arvados-dispatch-cloud[15971]: {"Address":"10.253.254.49","ClusterID":"tordo","ContainerUUID":"tordo-dz642-2l2xwswnvbvc8gk","Instance":"i-06cce6b3e0448b10d","InstanceType":"t3small","PID":15971,"level":"info","msg":"crunch-run process started","time":"2022-01-25T12:53:39.471920534Z"}

On the node itself, crunch-run and arv-mount are running but nothing more:

root     18119  0.0  0.1  12512  3212 pts/0    R+   13:07   0:00                      \_ ps auxwf
admin      631  0.0  0.4  21140  8976 ?        Ss   Jan25   0:00 /lib/systemd/systemd --user
admin      632  0.0  0.1 104852  2364 ?        S    Jan25   0:00  \_ (sd-pam)
root      1148  0.1  2.5 1333156 50724 ?       Sl   Jan25   5:24 /var/lib/arvados/crunch-run~89bea761309098144f11941ae52673f2 -no-detach --detach --stdin-config --runtime-engine=singularity tordo-dz642-2l2xwswnvbvc8gk
root      1161  0.1  2.4 1311492 48800 ?       Sl   Jan25   4:36  \_ /usr/share/python3/dist/python3-arvados-fuse/bin/python /usr/bin/arv-mount --foreground --read-write --storage-classes default --crunchstat-interval=10 --file-cache 268435456 --mount-by-pdh by_id --disable-event-listening --mount-by-id by_uuid /tmp/crunch-run.tordo-dz642-2l2xwswnvbvc8gk.4167237599/keep3068660272
root      1307  0.0  4.3 865924 87296 ?        Ssl  Jan25   0:37 /usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock --default-ulimit nofile=10000:10000 --dns 10.253.0.2
root      2104  0.0  1.1  32536 22200 ?        Ss   Jan25   0:00 /usr/share/python3/dist/arvados-docker-cleaner/bin/python /usr/bin/arvados-docker-cleaner
root@ip-10-253-254-49:/tmp# docker ps -a
CONTAINER ID   IMAGE     COMMAND   CREATED   STATUS    PORTS     NAMES
root@ip-10-253-254-49:/tmp# v
total 4
drwxrwxrwt  4 root root   48 Jan 27 08:25 ./
drwxr-xr-x 18 root root 4096 Jan 25 12:53 ../
drwx--x--x 14 root root  182 Jan 25 12:53 docker-data/
drwxr-xr-x  2 root root   18 Jan 25 12:54 hsperfdata_root/

The container (tordo-dz642-2l2xwswnvbvc8gk) is just one part of our standard test suite:

{
  "uuid": "tordo-dz642-2l2xwswnvbvc8gk",
  "owner_uuid": "tordo-tpzed-000000000000000",
  "created_at": "2022-01-25T12:53:29.978Z",
  "modified_at": "2022-01-27T12:58:31.556Z",
  "modified_by_client_uuid": "tordo-ozdt8-q6dzdi1lcc03155",
  "modified_by_user_uuid": "tordo-tpzed-000000000000000",
  "state": "Locked",
  "started_at": null,
  "finished_at": null,
  "log": "58d5e0dd2f63cc85d9146130dc10c54c+13282",
  "environment": {
    "HOME": "/var/spool/cwl",
    "TMPDIR": "/tmp" 
  },
  "cwd": "/var/spool/cwl",
  "command": [
    "/bin/sh",
    "-c",
    "echo \"HOME=$HOME\" \"TMPDIR=$TMPDIR\" && test \"$HOME\" = /var/spool/cwl -a \"$TMPDIR\" = /tmp" 
  ],
  "output_path": "/var/spool/cwl",
  "mounts": {
    "/tmp": {
      "capacity": 1073741824,
      "kind": "tmp" 
    },
    "/var/spool/cwl": {
      "capacity": 1073741824,
      "kind": "tmp" 
    }
  },
  "runtime_constraints": {
    "API": false,
    "cuda": {
      "device_count": 0,
      "driver_version": "",
      "hardware_capability": "" 
    },
    "keep_cache_ram": 268435456,
    "ram": 268435456,
    "vcpus": 1
  },
  "output": null,
  "container_image": "021e994505b006982494a7caf0cedd1d+261",
  "progress": null,
  "priority": 562948310306102004,
  "updated_at": null,
  "exit_code": null,
  "auth_uuid": "tordo-gj3su-rxkhzwtfimcmuij",
  "locked_by_uuid": "tordo-gj3su-000000000000000",
  "scheduling_parameters": {
    "max_run_time": 0,
    "partitions": [

    ],
    "preemptible": false
  },
  "runtime_status": {
  },
  "runtime_user_uuid": "ce8i5-tpzed-yzrv3k3xiq86td0",
  "runtime_auth_scopes": [
    "all" 
  ],
  "runtime_token": null,
  "lock_count": 1,
  "gateway_address": null,
  "interactive_session_started": false,
  "output_storage_classes": [
    "default" 
  ]
}

Versions:

root@ip-10-253-254-49:/usr/src# dpkg -l |grep arv
ii  arvados-docker-cleaner              2.3.0~dev20210729201354-1    amd64        Arvados Docker cleaner
ii  python3-arvados-fuse                2.4.0~dev20211126162134-1    amd64        Arvados FUSE driver
root@ip-10-253-254-49:/usr/src# dpkg -l |grep cru
ii  crunch-run                          2.4.0~dev20211230190012-1    amd64        Supervise a single Crunch container

Clearly we need to rebuild the image to get the latest arvados-docker-cleaner version but the fuse driver is up to date.

The arv-mount parent process and two of the (33!) children are stuck on FUTEX_WAIT:


# strace -f -p 1161
strace: Process 1161 attached with 34 threads
[pid  1275] read(5,  <unfinished ...>
[pid  1273] read(5,  <unfinished ...>
[pid  1272] read(5,  <unfinished ...>
[pid  1274] read(5,  <unfinished ...>
[pid  1271] read(5,  <unfinished ...>
[pid  1270] read(5,  <unfinished ...>
[pid  1269] read(5,  <unfinished ...>
[pid  1268] read(5,  <unfinished ...>
[pid  1263] read(5,  <unfinished ...>
[pid  1267] read(5,  <unfinished ...>
[pid  1266] read(5,  <unfinished ...>
[pid  1265] read(5,  <unfinished ...>
[pid  1264] read(5,  <unfinished ...>
[pid  1262] read(5,  <unfinished ...>
[pid  1261] read(5,  <unfinished ...>
[pid  1259] read(5,  <unfinished ...>
[pid  1260] read(5,  <unfinished ...>
[pid  1258] read(5,  <unfinished ...>
[pid  1257] read(5,  <unfinished ...>
[pid  1256] read(5,  <unfinished ...>
[pid  1250] read(5,  <unfinished ...>
[pid  1255] read(5,  <unfinished ...>
[pid  1254] read(5,  <unfinished ...>
[pid  1249] read(5,  <unfinished ...>
[pid  1248] read(5,  <unfinished ...>
[pid  1247] read(5,  <unfinished ...>
[pid  1245] read(5,  <unfinished ...>
[pid  1243] read(5,  <unfinished ...>
[pid  1244] read(5,  <unfinished ...>
[pid  1242] read(5,  <unfinished ...>
[pid  1241] futex(0x7ff9080013f0, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid  1239] futex(0x7ff911b05170, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid  1232] select(0, NULL, NULL, NULL, {tv_sec=9, tv_usec=562958} <unfinished ...>
[pid  1161] futex(0x7ff904000d50, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY

That's the C backtrace, not so useful. To get the python backtrace:

# apt install gdb python3-dbg
# cd /usr/share/python3/dist/python3-arvados-fuse
# . bin/activate
# gdb python 1161
GNU gdb (Debian 8.2.1-2+b3) 8.2.1
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from python...Reading symbols from /usr/lib/debug/.build-id/99/21c75e6930d3e9d9fa8c942aca9dc4500bb65f.debug...done.
done.
Attaching to program: /usr/share/python3/dist/python3-arvados-fuse/bin/python, process 1161
[New LWP 1232]
[New LWP 1239]
[New LWP 1241]
[New LWP 1242]
[New LWP 1243]
[New LWP 1244]
[New LWP 1245]
[New LWP 1247]
[New LWP 1248]
[New LWP 1249]
[New LWP 1250]
[New LWP 1254]
[New LWP 1255]
[New LWP 1256]
[New LWP 1257]
[New LWP 1258]
[New LWP 1259]
[New LWP 1260]
[New LWP 1261]
[New LWP 1262]
[New LWP 1263]
[New LWP 1264]
[New LWP 1265]
[New LWP 1266]
[New LWP 1267]
[New LWP 1268]
[New LWP 1269]
[New LWP 1270]
[New LWP 1271]
[New LWP 1272]
[New LWP 1273]
[New LWP 1274]
[New LWP 1275]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x7ff904000d50) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
205    ../sysdeps/unix/sysv/linux/futex-internal.h: No such file or directory.
(gdb) source /usr/share/gdb/auto-load/usr/bin/python3.7-gdb.py

And then:

(gdb) py-bt
Traceback (most recent call first):
  File "/usr/lib/python3.7/threading.py", line 1048, in _wait_for_tstate_lock
    elif lock.acquire(block, timeout):
  File "/usr/lib/python3.7/threading.py", line 1032, in join
    self._wait_for_tstate_lock()
  File "/usr/share/python3/dist/python3-arvados-fuse/lib/python3.7/site-packages/arvados_fuse/command.py", line 387, in _run_standalone
    self.llfuse_thread.join(timeout=None)
  File "/usr/share/python3/dist/python3-arvados-fuse/lib/python3.7/site-packages/arvados_fuse/command.py", line 176, in run
    self._run_standalone()
  File "/usr/bin/arv-mount", line 10, in <module>
    arvados_fuse.command.Mount(args).run()

That backtrace does not seem super helpful, though.


Files

tordo-dz642-2l2xwswnvbvc8gk.logs.bz2 (2.32 MB) tordo-dz642-2l2xwswnvbvc8gk.logs.bz2 tordo-dz642-2l2xwswnvbvc8gk livelogs from database Ward Vandewege, 01/27/2022 10:06 PM
Actions #1

Updated by Ward Vandewege 5 months ago

  • Status changed from New to In Progress
Actions #2

Updated by Ward Vandewege 5 months ago

  • Description updated (diff)
Actions #3

Updated by Ward Vandewege 5 months ago

  • Description updated (diff)
Actions #4

Updated by Ward Vandewege 5 months ago

  • Description updated (diff)
Actions #5

Updated by Ward Vandewege 5 months ago

  • Description updated (diff)
Actions #6

Updated by Ward Vandewege 5 months ago

  • Description updated (diff)
Actions #7

Updated by Ward Vandewege 5 months ago

  • Description updated (diff)
Actions #8

Updated by Ward Vandewege 5 months ago

  • Description updated (diff)
Actions #9

Updated by Ward Vandewege 5 months ago

  • Description updated (diff)
Actions #10

Updated by Ward Vandewege 5 months ago

  • Description updated (diff)
Actions #11

Updated by Ward Vandewege 5 months ago

  • Description updated (diff)
Actions #12

Updated by Ward Vandewege 5 months ago

Hmm, the node magically got unstuck?

tordo:~# journalctl -u arvados-dispatch-cloud.service |grep i-06cce6b3e0448b10d
Jan 25 12:53:02 tordo.arvadosapi.com arvados-dispatch-cloud[15971]: {"Address":"10.253.254.49","ClusterID":"tordo","IdleBehavior":"run","Instance":"i-06cce6b3e0448b10d","InstanceType":"t3small","PID":15971,"State":"booting","level":"info","msg":"instance appeared in cloud","time":"2022-01-25T12:53:02.659570016Z"}
Jan 25 12:53:38 tordo.arvadosapi.com arvados-dispatch-cloud[15971]: {"Address":"10.253.254.49","ClusterID":"tordo","Command":"sudo docker ps -q","Instance":"i-06cce6b3e0448b10d","InstanceType":"t3small","PID":15971,"level":"info","msg":"boot probe succeeded","stderr":"","stdout":"","time":"2022-01-25T12:53:38.438366587Z"}
Jan 25 12:53:38 tordo.arvadosapi.com arvados-dispatch-cloud[15971]: {"Address":"10.253.254.49","ClusterID":"tordo","Instance":"i-06cce6b3e0448b10d","InstanceType":"t3small","PID":15971,"cmd":"sudo sh -c 'set -e; dstdir=\"/var/lib/arvados/\"; dstfile=\"/var/lib/arvados/crunch-run~89bea761309098144f11941ae52673f2\"; mkdir -p \"$dstdir\"; touch \"$dstfile\"; chmod 0755 \"$dstdir\" \"$dstfile\"; cat \u003e\"$dstfile\"'","hash":"89bea761309098144f11941ae52673f2","level":"info","msg":"installing runner binary on worker","path":"/var/lib/arvados/crunch-run~89bea761309098144f11941ae52673f2","time":"2022-01-25T12:53:38.471727831Z"}
Jan 25 12:53:39 tordo.arvadosapi.com arvados-dispatch-cloud[15971]: {"Address":"10.253.254.49","ClusterID":"tordo","Instance":"i-06cce6b3e0448b10d","InstanceType":"t3small","PID":15971,"ProbeStart":"2022-01-25T12:53:30.678564128Z","level":"info","msg":"instance booted; will try probeRunning","time":"2022-01-25T12:53:39.371129567Z"}
Jan 25 12:53:39 tordo.arvadosapi.com arvados-dispatch-cloud[15971]: {"Address":"10.253.254.49","ClusterID":"tordo","Instance":"i-06cce6b3e0448b10d","InstanceType":"t3small","PID":15971,"ProbeStart":"2022-01-25T12:53:30.678564128Z","RunningContainers":0,"State":"idle","level":"info","msg":"probes succeeded, instance is in service","time":"2022-01-25T12:53:39.429678726Z"}
Jan 25 12:53:39 tordo.arvadosapi.com arvados-dispatch-cloud[15971]: {"Address":"10.253.254.49","ClusterID":"tordo","ContainerUUID":"tordo-dz642-2l2xwswnvbvc8gk","Instance":"i-06cce6b3e0448b10d","InstanceType":"t3small","PID":15971,"level":"info","msg":"crunch-run process started","time":"2022-01-25T12:53:39.471920534Z"}
Jan 27 16:02:17 tordo.arvadosapi.com arvados-dispatch-cloud[19165]: {"Address":"10.253.254.49","ClusterID":"tordo","IdleBehavior":"run","Instance":"i-06cce6b3e0448b10d","InstanceType":"t3small","PID":19165,"State":"unknown","level":"info","msg":"instance appeared in cloud","time":"2022-01-27T16:02:17.392979834Z"}
Jan 27 16:02:27 tordo.arvadosapi.com arvados-dispatch-cloud[19165]: {"Address":"10.253.254.49","ClusterID":"tordo","Command":"sudo docker ps -q","Instance":"i-06cce6b3e0448b10d","InstanceType":"t3small","PID":19165,"level":"info","msg":"boot probe succeeded","stderr":"","stdout":"","time":"2022-01-27T16:02:27.398813476Z"}
Jan 27 16:02:27 tordo.arvadosapi.com arvados-dispatch-cloud[19165]: {"Address":"10.253.254.49","ClusterID":"tordo","Instance":"i-06cce6b3e0448b10d","InstanceType":"t3small","PID":19165,"cmd":"sudo sh -c 'set -e; dstdir=\"/var/lib/arvados/\"; dstfile=\"/var/lib/arvados/crunch-run~37e2c8689a3cc4ab12eb21a40a002169\"; mkdir -p \"$dstdir\"; touch \"$dstfile\"; chmod 0755 \"$dstdir\" \"$dstfile\"; cat \u003e\"$dstfile\"'","hash":"37e2c8689a3cc4ab12eb21a40a002169","level":"info","msg":"installing runner binary on worker","path":"/var/lib/arvados/crunch-run~37e2c8689a3cc4ab12eb21a40a002169","time":"2022-01-27T16:02:27.407067063Z"}
Jan 27 16:02:27 tordo.arvadosapi.com arvados-dispatch-cloud[19165]: {"Address":"10.253.254.49","ClusterID":"tordo","Instance":"i-06cce6b3e0448b10d","InstanceType":"t3small","PID":19165,"ProbeStart":"2022-01-27T16:02:27.079485424Z","level":"info","msg":"instance booted; will try probeRunning","time":"2022-01-27T16:02:27.744023426Z"}
Jan 27 16:02:27 tordo.arvadosapi.com arvados-dispatch-cloud[19165]: {"Address":"10.253.254.49","ClusterID":"tordo","ContainerUUID":"tordo-dz642-2l2xwswnvbvc8gk","Instance":"i-06cce6b3e0448b10d","InstanceType":"t3small","PID":19165,"level":"info","msg":"crunch-run process detected","time":"2022-01-27T16:02:27.771901499Z"}
Jan 27 16:02:27 tordo.arvadosapi.com arvados-dispatch-cloud[19165]: {"Address":"10.253.254.49","ClusterID":"tordo","Instance":"i-06cce6b3e0448b10d","InstanceType":"t3small","PID":19165,"ProbeStart":"2022-01-27T16:02:27.079485424Z","RunningContainers":1,"State":"running","level":"info","msg":"probes succeeded, instance is in service","time":"2022-01-27T16:02:27.772380385Z"}
Jan 27 16:02:27 tordo.arvadosapi.com arvados-dispatch-cloud[19165]: {"Address":"10.253.254.49","ClusterID":"tordo","ContainerUUID":"tordo-dz642-2l2xwswnvbvc8gk","Instance":"i-06cce6b3e0448b10d","InstanceType":"t3small","PID":19165,"Reason":"pool says running, but queue says state=Queued","level":"info","msg":"killing crunch-run process","time":"2022-01-27T16:02:27.824445564Z"}
Jan 27 16:02:32 tordo.arvadosapi.com arvados-dispatch-cloud[19165]: {"Address":"10.253.254.49","ClusterID":"tordo","ContainerUUID":"tordo-dz642-2l2xwswnvbvc8gk","Instance":"i-06cce6b3e0448b10d","InstanceType":"t3small","PID":19165,"Signal":15,"level":"info","msg":"sending signal","time":"2022-01-27T16:02:32.824569537Z"}
Jan 27 16:02:32 tordo.arvadosapi.com arvados-dispatch-cloud[19165]: {"Address":"10.253.254.49","ClusterID":"tordo","ContainerUUID":"tordo-dz642-2l2xwswnvbvc8gk","Instance":"i-06cce6b3e0448b10d","InstanceType":"t3small","PID":19165,"level":"info","msg":"crunch-run process ended","time":"2022-01-27T16:02:32.903696589Z"}
Jan 27 16:02:37 tordo.arvadosapi.com arvados-dispatch-cloud[19165]: {"Address":"10.253.254.49","ClusterID":"tordo","ContainerUUID":"tordo-dz642-2l2xwswnvbvc8gk","Instance":"i-06cce6b3e0448b10d","InstanceType":"t3small","PID":19165,"level":"info","msg":"probe reported stale run lock","time":"2022-01-27T16:02:37.112809161Z"}
Jan 27 16:02:38 tordo.arvadosapi.com arvados-dispatch-cloud[19165]: {"Address":"10.253.254.49","ClusterID":"tordo","ContainerUUID":"tordo-dz642-2l2xwswnvbvc8gk","Instance":"i-06cce6b3e0448b10d","InstanceType":"t3small","PID":19165,"error":"Process exited with status 1","level":"error","msg":"error starting crunch-run process","stderr":"lock /var/lock/crunch-run-tordo-dz642-2l2xwswnvbvc8gk.lock: resource temporarily unavailable\n","stdout":"","time":"2022-01-27T16:02:38.079297369Z"}
Jan 27 16:02:47 tordo.arvadosapi.com arvados-dispatch-cloud[19165]: {"Address":"10.253.254.49","ClusterID":"tordo","ContainerUUID":"tordo-dz642-2l2xwswnvbvc8gk","Instance":"i-06cce6b3e0448b10d","InstanceType":"t3small","PID":19165,"level":"info","msg":"probe reported stale run lock","time":"2022-01-27T16:02:47.113177337Z"}
Jan 27 16:02:47 tordo.arvadosapi.com arvados-dispatch-cloud[19165]: {"Address":"10.253.254.49","ClusterID":"tordo","Duration":10000368046,"Instance":"i-06cce6b3e0448b10d","InstanceType":"t3small","PID":19165,"level":"warning","msg":"reporting broken after reporting stale run lock for too long","time":"2022-01-27T16:02:47.113231268Z"}
Jan 27 16:02:47 tordo.arvadosapi.com arvados-dispatch-cloud[19165]: {"Address":"10.253.254.49","ClusterID":"tordo","Instance":"i-06cce6b3e0448b10d","InstanceType":"t3small","PID":19165,"ProbeStart":"2022-01-27T16:02:47.078663327Z","level":"info","msg":"probe reported broken instance","time":"2022-01-27T16:02:47.113268338Z"}
Jan 27 16:02:47 tordo.arvadosapi.com arvados-dispatch-cloud[19165]: {"Address":"10.253.254.49","ClusterID":"tordo","IdleBehavior":"drain","Instance":"i-06cce6b3e0448b10d","InstanceType":"t3small","PID":19165,"level":"info","msg":"set idle behavior","time":"2022-01-27T16:02:47.113301169Z"}
Jan 27 16:02:47 tordo.arvadosapi.com arvados-dispatch-cloud[19165]: {"Address":"10.253.254.49","ClusterID":"tordo","ContainerUUID":"tordo-dz642-2l2xwswnvbvc8gk","Instance":"i-06cce6b3e0448b10d","InstanceType":"t3small","PID":19165,"level":"info","msg":"crunch-run process ended","time":"2022-01-27T16:02:47.113332329Z"}
Jan 27 16:02:57 tordo.arvadosapi.com arvados-dispatch-cloud[19165]: {"Address":"10.253.254.49","ClusterID":"tordo","IdleBehavior":"drain","IdleDuration":9.965893,"Instance":"i-06cce6b3e0448b10d","InstanceType":"t3small","PID":19165,"State":"idle","level":"info","msg":"shutdown worker","time":"2022-01-27T16:02:57.079279400Z"}
Jan 27 16:03:17 tordo.arvadosapi.com arvados-dispatch-cloud[19165]: {"ClusterID":"tordo","Instance":"i-06cce6b3e0448b10d","PID":19165,"WorkerState":"shutdown","level":"info","msg":"instance disappeared in cloud","time":"2022-01-27T16:03:17.681692165Z"}
Actions #13

Updated by Ward Vandewege 5 months ago

systemtap output for a ~10 second or so period while in this state:

root@ip-10-253-254-49:/usr/src# stap futexes.stp 
^Crs:main Q:Reg[564] lock 0x56082d5cb254 contended 56 times, 703166 avg us
rs:main Q:Reg[564] lock 0x56082d5cb250 contended 55 times, 375197 avg us
containerd[568] lock 0xc00005a950 contended 178 times, 178694 avg us
containerd[568] lock 0xc00005bd50 contended 209 times, 176802 avg us
containerd[568] lock 0x55b385a9fad8 contended 385 times, 168496 avg us
crunch-run~89be[1148] lock 0x262e0f8 contended 733 times, 87975 avg us
crunch-run~89be[1148] lock 0xc00010c950 contended 339 times, 110516 avg us
crunch-run~89be[1148] lock 0xc000990150 contended 183 times, 264216 avg us
crunch-run~89be[1148] lock 0x262cad0 contended 256 times, 188071 avg us
crunch-run~89be[1148] lock 0x262e118 contended 2 times, 18 avg us
crunch-run~89be[1148] lock 0x262dff8 contended 2 times, 22 avg us
crunch-run~89be[1148] lock 0x265c5a8 contended 1 times, 10 avg us
dockerd[1307] lock 0x555d7a7f8c60 contended 1 times, 10 avg us
dockerd[1307] lock 0x555d7a7fcb40 contended 6 times, 8333299 avg us
dockerd[1307] lock 0xc0003ce4c8 contended 8 times, 6247580 avg us
dockerd[1307] lock 0xc000170848 contended 6 times, 8330130 avg us
dockerd[1307] lock 0x555d7a7f8c70 contended 6 times, 8328538 avg us
dockerd[1307] lock 0x555d7a7f8c88 contended 1 times, 17 avg us
crunch-run~89be[21460] lock 0xc00005a950 contended 5 times, 1569 avg us
crunch-run~89be[21460] lock 0x265cdd8 contended 1 times, 13487 avg us
crunch-run~89be[21460] lock 0xc00005ad50 contended 15 times, 875 avg us
crunch-run~89be[21460] lock 0x262e118 contended 5 times, 98 avg us
crunch-run~89be[21460] lock 0x262e0f8 contended 1 times, 675 avg us
crunch-run~89be[21460] lock 0x262e0e8 contended 4 times, 140 avg us
crunch-run~89be[21460] lock 0x265dd88 contended 2 times, 21 avg us
crunch-run~89be[21460] lock 0x262cad0 contended 8 times, 105 avg us
crunch-run~89be[21460] lock 0x262dff8 contended 2 times, 2 avg us
crunch-run~89be[21468] lock 0x262e0e8 contended 1 times, 24 avg us
crunch-run~89be[21468] lock 0x265cdd8 contended 1 times, 8053 avg us
crunch-run~89be[21468] lock 0xc00005ad50 contended 11 times, 1056 avg us
crunch-run~89be[21468] lock 0xc000100150 contended 9 times, 1769 avg us
crunch-run~89be[21468] lock 0xc00005a950 contended 1 times, 20593 avg us
crunch-run~89be[21468] lock 0x262e118 contended 3 times, 95 avg us
crunch-run~89be[21468] lock 0x262dff8 contended 2 times, 0 avg us
crunch-run~89be[21468] lock 0x265dd88 contended 2 times, 12 avg us
crunch-run~89be[21468] lock 0x262cad0 contended 8 times, 237 avg us
crunch-run~89be[21475] lock 0x265dd88 contended 1 times, 237 avg us
crunch-run~89be[21475] lock 0x262e118 contended 4 times, 357 avg us
crunch-run~89be[21475] lock 0x262cad0 contended 10 times, 270 avg us
crunch-run~89be[21475] lock 0xc00005ad50 contended 15 times, 1594 avg us
crunch-run~89be[21475] lock 0x262e0e8 contended 2 times, 510 avg us
crunch-run~89be[21475] lock 0x262dff8 contended 5 times, 3 avg us
crunch-run~89be[21475] lock 0x265cdd8 contended 1 times, 25977 avg us
crunch-run~89be[21475] lock 0x2645700 contended 2 times, 1 avg us
crunch-run~89be[21475] lock 0xc00005a950 contended 7 times, 2149 avg us
crunch-run~89be[21475] lock 0x262e0f8 contended 1 times, 433 avg us
crunch-run~89be[21481] lock 0x265cdd8 contended 1 times, 30258 avg us
crunch-run~89be[21481] lock 0x262e118 contended 8 times, 250 avg us
crunch-run~89be[21481] lock 0xc00005a950 contended 13 times, 2034 avg us
crunch-run~89be[21481] lock 0xc00005ad50 contended 8 times, 2656 avg us
crunch-run~89be[21481] lock 0x262e0e8 contended 2 times, 24 avg us
crunch-run~89be[21481] lock 0x262dff8 contended 4 times, 4 avg us
crunch-run~89be[21481] lock 0x262cad0 contended 7 times, 278 avg us
crunch-run~89be[21481] lock 0x265dd88 contended 1 times, 26 avg us
crunch-run~89be[21490] lock 0x262e0e8 contended 2 times, 14 avg us
crunch-run~89be[21490] lock 0x262cad0 contended 8 times, 117 avg us
crunch-run~89be[21490] lock 0x262dff8 contended 2 times, 17 avg us
crunch-run~89be[21490] lock 0x265dd88 contended 1 times, 20 avg us
crunch-run~89be[21490] lock 0xc00005ad50 contended 14 times, 914 avg us
crunch-run~89be[21490] lock 0xc00005a950 contended 9 times, 976 avg us
crunch-run~89be[21490] lock 0x262e118 contended 4 times, 116 avg us
crunch-run~89be[21490] lock 0x265cdd8 contended 1 times, 13479 avg us
crunch-run~89be[21499] lock 0xc00005ad50 contended 7 times, 1353 avg us
crunch-run~89be[21499] lock 0x262dff8 contended 2 times, 1 avg us
crunch-run~89be[21499] lock 0x262cad0 contended 9 times, 73 avg us
crunch-run~89be[21499] lock 0x265dd88 contended 1 times, 125 avg us
crunch-run~89be[21499] lock 0x262e118 contended 4 times, 89 avg us
crunch-run~89be[21499] lock 0xc00005a950 contended 15 times, 959 avg us
crunch-run~89be[21499] lock 0x262e0e8 contended 1 times, 23 avg us
crunch-run~89be[21499] lock 0x265cdd8 contended 1 times, 14126 avg us
crunch-run~89be[21509] lock 0xc00005ad50 contended 6 times, 1179 avg us
crunch-run~89be[21509] lock 0x262cad0 contended 6 times, 1307 avg us
crunch-run~89be[21509] lock 0x265cdd8 contended 1 times, 12077 avg us
crunch-run~89be[21509] lock 0x262e118 contended 4 times, 25 avg us
crunch-run~89be[21509] lock 0x262dff8 contended 3 times, 8 avg us
crunch-run~89be[21509] lock 0xc0001a6150 contended 9 times, 1107 avg us
crunch-run~89be[21509] lock 0x262e0e8 contended 1 times, 19 avg us
crunch-run~89be[21509] lock 0xc00005a950 contended 6 times, 2006 avg us
WARNING: Number of errors: 0, skipped probes: 28
Actions #14

Updated by Ward Vandewege 5 months ago

todo:
  • get container live logs from the database, what is in the container log collection is only for the successful run from today

-> added as tordo-dz642-2l2xwswnvbvc8gk.logs.bz2

  • verify that the tordo image does not have the bug from #18027 (how old is it?), boot a node and look at the cloud init script.

> the image is from 2021-10-11
> as part of #18027 we didn't actually update the BootProbeCommand on tordo, and the `/usr/local/bin/ensure-encrypted-partitions.sh` script was also not changed.

  • [DONE] we should switch the tordo config.yml to the default BootProbeCommand, this cluster actually runs singularity so the `sudo docker ps -q` command we have is not really relevant (even though it works because of the way our images are set up).
Actions #15

Updated by Tom Clegg 5 months ago

The disappearing temp directory looks very similar to #18027#note-9 ... suggest
  • change BootProbeCommand to "systemctl is-system-running" (this is the current default)
  • check the boot logs on a tordo compute instance for a pattern like this, where docker is running briefly before /tmp is mounted:
    Sep 02 20:52:32 ip-10-253-254-10 systemd[1]: docker.service: Succeeded.
    Sep 02 20:52:32 ip-10-253-254-10 systemd[1]: Stopped Docker Application Container Engine.
    ...
    Sep 02 20:52:32 ip-10-253-254-10 cloud-init[628]: + mount -o async /dev/mapper/tmp /tmp
    ...
    Sep 02 20:52:33 ip-10-253-254-10 cloud-init[628]: + service docker start
    Sep 02 20:52:33 ip-10-253-254-10 systemd[1]: Starting Docker Socket for the API.
    Sep 02 20:52:33 ip-10-253-254-10 systemd[1]: Listening on Docker Socket for the API.
    Sep 02 20:52:33 ip-10-253-254-10 systemd[1]: Starting Docker Application Container Engine...
    
Actions #16

Updated by Ward Vandewege 5 months ago

I think what happened here is described in https://dev.arvados.org/issues/18027#note-19. Changing the BootProbeCommand to the default (not docker ps -q) should avoid a recurrence of this problem.

Actions #17

Updated by Ward Vandewege 5 months ago

  • Status changed from In Progress to Resolved
Actions

Also available in: Atom PDF