Bug #18686
Updated by Ward Vandewege about 3 years ago
This is on Tordo (AWS). A compute node has been running for over 2 days:
<pre>
root@ip-10-253-254-49:/home/admin# uptime
13:05:39 up 2 days, 12 min, 1 user, load average: 0.02, 0.03, 0.00
</pre>
The AWS ID is i-06cce6b3e0448b10d at 10.253.254.49.
The a-d-c logs say:
<pre>
tordo:~# journalctl -u arvados-dispatch-cloud.service -n100000|grep i-06cce6b3e0448b10d
Jan 25 12:53:02 tordo.arvadosapi.com arvados-dispatch-cloud[15971]: {"Address":"10.253.254.49","ClusterID":"tordo","IdleBehavior":"run","Instance":"i-06cce6b3e0448b10d","InstanceType":"t3small","PID":15971,"State":"booting","level":"info","msg":"instance appeared in cloud","time":"2022-01-25T12:53:02.659570016Z"}
Jan 25 12:53:38 tordo.arvadosapi.com arvados-dispatch-cloud[15971]: {"Address":"10.253.254.49","ClusterID":"tordo","Command":"sudo docker ps -q","Instance":"i-06cce6b3e0448b10d","InstanceType":"t3small","PID":15971,"level":"info","msg":"boot probe succeeded","stderr":"","stdout":"","time":"2022-01-25T12:53:38.438366587Z"}
Jan 25 12:53:38 tordo.arvadosapi.com arvados-dispatch-cloud[15971]: {"Address":"10.253.254.49","ClusterID":"tordo","Instance":"i-06cce6b3e0448b10d","InstanceType":"t3small","PID":15971,"cmd":"sudo sh -c 'set -e; dstdir=\"/var/lib/arvados/\"; dstfile=\"/var/lib/arvados/crunch-run~89bea761309098144f11941ae52673f2\"; mkdir -p \"$dstdir\"; touch \"$dstfile\"; chmod 0755 \"$dstdir\" \"$dstfile\"; cat \u003e\"$dstfile\"'","hash":"89bea761309098144f11941ae52673f2","level":"info","msg":"installing runner binary on worker","path":"/var/lib/arvados/crunch-run~89bea761309098144f11941ae52673f2","time":"2022-01-25T12:53:38.471727831Z"}
Jan 25 12:53:39 tordo.arvadosapi.com arvados-dispatch-cloud[15971]: {"Address":"10.253.254.49","ClusterID":"tordo","Instance":"i-06cce6b3e0448b10d","InstanceType":"t3small","PID":15971,"ProbeStart":"2022-01-25T12:53:30.678564128Z","level":"info","msg":"instance booted; will try probeRunning","time":"2022-01-25T12:53:39.371129567Z"}
Jan 25 12:53:39 tordo.arvadosapi.com arvados-dispatch-cloud[15971]: {"Address":"10.253.254.49","ClusterID":"tordo","Instance":"i-06cce6b3e0448b10d","InstanceType":"t3small","PID":15971,"ProbeStart":"2022-01-25T12:53:30.678564128Z","RunningContainers":0,"State":"idle","level":"info","msg":"probes succeeded, instance is in service","time":"2022-01-25T12:53:39.429678726Z"}
Jan 25 12:53:39 tordo.arvadosapi.com arvados-dispatch-cloud[15971]: {"Address":"10.253.254.49","ClusterID":"tordo","ContainerUUID":"tordo-dz642-2l2xwswnvbvc8gk","Instance":"i-06cce6b3e0448b10d","InstanceType":"t3small","PID":15971,"level":"info","msg":"crunch-run process started","time":"2022-01-25T12:53:39.471920534Z"}
</pre>
On the node itself, @crunch-run@ and @arv-mount@ are running but nothing more:
<pre>
root 18119 0.0 0.1 12512 3212 pts/0 R+ 13:07 0:00 \_ ps auxwf
admin 631 0.0 0.4 21140 8976 ? Ss Jan25 0:00 /lib/systemd/systemd --user
admin 632 0.0 0.1 104852 2364 ? S Jan25 0:00 \_ (sd-pam)
root 1148 0.1 2.5 1333156 50724 ? Sl Jan25 5:24 /var/lib/arvados/crunch-run~89bea761309098144f11941ae52673f2 -no-detach --detach --stdin-config --runtime-engine=singularity tordo-dz642-2l2xwswnvbvc8gk
root 1161 0.1 2.4 1311492 48800 ? Sl Jan25 4:36 \_ /usr/share/python3/dist/python3-arvados-fuse/bin/python /usr/bin/arv-mount --foreground --read-write --storage-classes default --crunchstat-interval=10 --file-cache 268435456 --mount-by-pdh by_id --disable-event-listening --mount-by-id by_uuid /tmp/crunch-run.tordo-dz642-2l2xwswnvbvc8gk.4167237599/keep3068660272
root 1307 0.0 4.3 865924 87296 ? Ssl Jan25 0:37 /usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock --default-ulimit nofile=10000:10000 --dns 10.253.0.2
root 2104 0.0 1.1 32536 22200 ? Ss Jan25 0:00 /usr/share/python3/dist/arvados-docker-cleaner/bin/python /usr/bin/arvados-docker-cleaner
</pre>
<pre>
root@ip-10-253-254-49:/tmp# docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
</pre>
<pre>
root@ip-10-253-254-49:/tmp# v
total 4
drwxrwxrwt 4 root root 48 Jan 27 08:25 ./
drwxr-xr-x 18 root root 4096 Jan 25 12:53 ../
drwx--x--x 14 root root 182 Jan 25 12:53 docker-data/
drwxr-xr-x 2 root root 18 Jan 25 12:54 hsperfdata_root/
</pre>
The container (tordo-dz642-2l2xwswnvbvc8gk) is just one part of our standard test suite:
<pre>
{
"uuid": "tordo-dz642-2l2xwswnvbvc8gk",
"owner_uuid": "tordo-tpzed-000000000000000",
"created_at": "2022-01-25T12:53:29.978Z",
"modified_at": "2022-01-27T12:58:31.556Z",
"modified_by_client_uuid": "tordo-ozdt8-q6dzdi1lcc03155",
"modified_by_user_uuid": "tordo-tpzed-000000000000000",
"state": "Locked",
"started_at": null,
"finished_at": null,
"log": "58d5e0dd2f63cc85d9146130dc10c54c+13282",
"environment": {
"HOME": "/var/spool/cwl",
"TMPDIR": "/tmp"
},
"cwd": "/var/spool/cwl",
"command": [
"/bin/sh",
"-c",
"echo \"HOME=$HOME\" \"TMPDIR=$TMPDIR\" && test \"$HOME\" = /var/spool/cwl -a \"$TMPDIR\" = /tmp"
],
"output_path": "/var/spool/cwl",
"mounts": {
"/tmp": {
"capacity": 1073741824,
"kind": "tmp"
},
"/var/spool/cwl": {
"capacity": 1073741824,
"kind": "tmp"
}
},
"runtime_constraints": {
"API": false,
"cuda": {
"device_count": 0,
"driver_version": "",
"hardware_capability": ""
},
"keep_cache_ram": 268435456,
"ram": 268435456,
"vcpus": 1
},
"output": null,
"container_image": "021e994505b006982494a7caf0cedd1d+261",
"progress": null,
"priority": 562948310306102004,
"updated_at": null,
"exit_code": null,
"auth_uuid": "tordo-gj3su-rxkhzwtfimcmuij",
"locked_by_uuid": "tordo-gj3su-000000000000000",
"scheduling_parameters": {
"max_run_time": 0,
"partitions": [
],
"preemptible": false
},
"runtime_status": {
},
"runtime_user_uuid": "ce8i5-tpzed-yzrv3k3xiq86td0",
"runtime_auth_scopes": [
"all"
],
"runtime_token": null,
"lock_count": 1,
"gateway_address": null,
"interactive_session_started": false,
"output_storage_classes": [
"default"
]
}
</pre>
Versions:
<pre>
root@ip-10-253-254-49:/usr/src# dpkg -l |grep arv
ii arvados-docker-cleaner 2.3.0~dev20210729201354-1 amd64 Arvados Docker cleaner
ii python3-arvados-fuse 2.4.0~dev20211126162134-1 amd64 Arvados FUSE driver
root@ip-10-253-254-49:/usr/src# dpkg -l |grep cru
ii crunch-run 2.4.0~dev20211230190012-1 amd64 Supervise a single Crunch container
</pre>
Clearly we need to rebuild the image to get the latest arvados-docker-cleaner version but the fuse driver is up to date.