Bug #18027
closed[crunch-run] upload failed (singularity)
Added by Ward Vandewege over 3 years ago. Updated almost 3 years ago.
Description
From https://dev.arvados.org/issues/17755#note-27
===================================================================================================
The 3rd failure is puzzling, if you check stderr it actually did run to the end, but failed the upload. This happened after the intervention of being stuck in queued state and killing the compute node.
2021-08-07T00:08:37.343893355Z Executing container 'tordo-dz642-xm3fbl5he5o2aul'
...
2021-08-07T00:08:39.823076366Z Waiting for container to finish
2021-08-07T00:10:47.551657400Z error in CaptureOutput: error scanning files to copy to output: lstat "/var/spool/cwl": lstat /tmp/crunch-run.tordo-dz642-xm3fbl5he5o2aul.609138299/tmp536330368: no such file or directory
2021-08-07T00:10:47.568368336Z Cancelled
Updated by Ward Vandewege over 3 years ago
- Related to Idea #17755: Test singularity support on a cloud cluster by running some real workflows added
Updated by Peter Amstutz over 3 years ago
- Related to Bug #18026: [singularity] crunch-run image cache race condition added
Updated by Peter Amstutz over 3 years ago
- Target version changed from 2021-09-01 sprint to 2021-09-15 sprint
Updated by Tom Clegg over 3 years ago
This is a tordo image / configuration bug.
I ran a "sleep 1800" container and logged in to the node to see journalctl.
arvados-dispatch-cloud uses "sudo docker ps -q" as a BootProbeCommand, but the node isn't really ready yet when docker starts. During the boot process docker appears to get restarted (which is bound to cause trouble) and the encrypted filesystem is mounted at /tmp
(which will cause exactly the problem we see in this ticket where crunch-run writes some files in /tmp
only to have them mysteriously disappear).
Sep 02 20:52:27 ip-10-253-254-10 cloud-init[628]: + cryptsetup --key-file /var/tmp/key-WvBb7fhA.tmp luksOpen /dev/mapper/compute-tmp tmp ... Sep 02 20:52:31 ip-10-253-254-10 crunch-run[1108]: tordo-dz642-wd0nj7l9cjxoi7o 2021-09-02T20:52:31.976846994Z Fetching Docker image from collection '4ad7d11381df349e464694762db14e04+303' Sep 02 20:52:32 ip-10-253-254-10 crunch-run[1108]: tordo-dz642-wd0nj7l9cjxoi7o 2021-09-02T20:52:32.017126993Z Using Docker image id "sha256:e67b8c126d8f2d411d72aa04cc0ab87dace18eef152d5f0b07dd677284fc0002" Sep 02 20:52:32 ip-10-253-254-10 crunch-run[1108]: tordo-dz642-wd0nj7l9cjxoi7o 2021-09-02T20:52:32.017155204Z Loading Docker image from keep Sep 02 20:52:32 ip-10-253-254-10 systemd[1]: docker.service: Succeeded. Sep 02 20:52:32 ip-10-253-254-10 systemd[1]: Stopped Docker Application Container Engine. Sep 02 20:52:32 ip-10-253-254-10 systemd[1]: docker.socket: Succeeded. Sep 02 20:52:32 ip-10-253-254-10 systemd[1]: Closed Docker Socket for the API. Sep 02 20:52:32 ip-10-253-254-10 cloud-init[628]: + ensure_umount /tmp/docker/aufs Sep 02 20:52:32 ip-10-253-254-10 cloud-init[628]: + findmntq /tmp/docker/aufs Sep 02 20:52:32 ip-10-253-254-10 cloud-init[628]: + findmnt /tmp/docker/aufs Sep 02 20:52:32 ip-10-253-254-10 cloud-init[628]: + MOUNTOPTIONS=async Sep 02 20:52:32 ip-10-253-254-10 cloud-init[628]: + mount -o async /dev/mapper/tmp /tmp Sep 02 20:52:33 ip-10-253-254-10 sh[585]: Traceback (most recent call last): Sep 02 20:52:33 ip-10-253-254-10 sh[585]: File "/usr/lib/python3.7/http/client.py", line 558, in _get_chunk_left Sep 02 20:52:33 ip-10-253-254-10 sh[585]: chunk_left = self._read_next_chunk_size() Sep 02 20:52:33 ip-10-253-254-10 sh[585]: File "/usr/lib/python3.7/http/client.py", line 525, in _read_next_chunk_size Sep 02 20:52:33 ip-10-253-254-10 sh[585]: return int(line, 16) ... Sep 02 20:52:33 ip-10-253-254-10 sh[585]: File "/usr/share/python3/dist/arvados-docker-cleaner/lib/python3.7/site-packages/urllib3/response.py", line 455, in _error_catcher Sep 02 20:52:33 ip-10-253-254-10 sh[585]: raise ProtocolError("Connection broken: %r" % e, e) Sep 02 20:52:33 ip-10-253-254-10 sh[585]: urllib3.exceptions.ProtocolError: ('Connection broken: IncompleteRead(0 bytes read)', IncompleteRead(0 bytes read)) Sep 02 20:52:33 ip-10-253-254-10 systemd[1]: arvados-docker-cleaner.service: Main process exited, code=exited, status=1/FAILURE Sep 02 20:52:33 ip-10-253-254-10 systemd[1]: arvados-docker-cleaner.service: Failed with result 'exit-code'. Sep 02 20:52:33 ip-10-253-254-10 crunch-run[1108]: tordo-dz642-wd0nj7l9cjxoi7o 2021-09-02T20:52:33.192786081Z building singularity image Sep 02 20:52:33 ip-10-253-254-10 crunch-run[1108]: tordo-dz642-wd0nj7l9cjxoi7o 2021-09-02T20:52:33.250629814Z [singularity build /tmp/crunch-run.tordo-dz642-wd0nj7l9cjxoi7o.800734095/keep912285602/by_uuid/tordo-4zz18-gi4qh4ynirsg9vo/image.sif docker-archive:///tmp/crunch-run-sing Sep 02 20:52:33 ip-10-253-254-10 kernel: SGI XFS with ACLs, security attributes, realtime, no debug enabled Sep 02 20:52:33 ip-10-253-254-10 kernel: XFS (dm-1): Mounting V5 Filesystem Sep 02 20:52:33 ip-10-253-254-10 kernel: XFS (dm-1): Ending clean mount Sep 02 20:52:33 ip-10-253-254-10 cloud-init[628]: + chmod a+w,+t /tmp Sep 02 20:52:33 ip-10-253-254-10 cloud-init[628]: + cat Sep 02 20:52:33 ip-10-253-254-10 cloud-init[628]: + '[' -d /etc/sv/docker.io ']' Sep 02 20:52:33 ip-10-253-254-10 cloud-init[628]: + service docker start Sep 02 20:52:33 ip-10-253-254-10 systemd[1]: Starting Docker Socket for the API. Sep 02 20:52:33 ip-10-253-254-10 systemd[1]: Listening on Docker Socket for the API. Sep 02 20:52:33 ip-10-253-254-10 systemd[1]: Starting Docker Application Container Engine... ...
Updated by Tom Clegg over 3 years ago
- 22102137c 18027: Kill arv-mount directly if arv-mount --unmount cannot run.
- Not sure how often this happens, but if we can't even start "arv-mount --unmount" (arv-mount command has disappeared from PATH, cannot fork, etc) it seems like we should at least send SIGKILL to our arv-mount process before giving up.
- 9ce401977 18027: Fix crash when reporting error during --unmount.
- Python3 error... sys.stderr.write(somebytes) crashes so we never get to see the fusermount error message. This happened in the tordo container -- presumably the error was that the mount point didn't exist, because see above.
18027-unmount-fuse @ 22102137c1e5eda8543c79d30d963ceb9e9a47af -- developer-run-tests: #2671
Updated by Peter Amstutz over 3 years ago
Tom Clegg wrote:
While investigating I found a couple of other things to fix
- 22102137c 18027: Kill arv-mount directly if arv-mount --unmount cannot run.
- Not sure how often this happens, but if we can't even start "arv-mount --unmount" (arv-mount command has disappeared from PATH, cannot fork, etc) it seems like we should at least send SIGKILL to our arv-mount process before giving up.
Probably related to --unmount not working, I noticed that even though it was intended to operate in a way that it wouldn't get stuck in the "dead mount" tar baby, it actually still does. Maybe we should try again to fix that?
- 9ce401977 18027: Fix crash when reporting error during --unmount.
- Python3 error... sys.stderr.write(somebytes) crashes so we never get to see the fusermount error message. This happened in the tordo container -- presumably the error was that the mount point didn't exist, because see above.
Great, yea, I've seen that happen a few times, it's embarassing.
18027-unmount-fuse @ 22102137c1e5eda8543c79d30d963ceb9e9a47af -- developer-run-tests: #2671
This branch LGTM
Updated by Tom Clegg over 3 years ago
- Status changed from In Progress to Resolved
Updated by Tom Clegg over 3 years ago
Changed default boot probe command to "systemctl is-system-running".
Considered adding "--wait", but- it isn't supported in medium-old systemd versions, and it would be super annoying to have a default boot probe that always fails.
--wait
doesn't seem to offer a timeout option, which means there's no feedback to report if we reach boot timeout while waiting for it. OTOH without the--wait
option, in the event of a stuck boot script we get a systemd state like "starting" or "degraded" included with a-d-c's "boot failed" log entry.
18027-boot-probe-command @ b169b40808a571663b695d638fb2915625624734 -- developer-run-tests: #2683
Updated by Tom Clegg over 3 years ago
- Status changed from Resolved to In Progress
Updated by Ward Vandewege over 3 years ago
Tom Clegg wrote:
Changed default boot probe command to "systemctl is-system-running".
Considered adding "--wait", but
- it isn't supported in medium-old systemd versions, and it would be super annoying to have a default boot probe that always fails.
--wait
doesn't seem to offer a timeout option, which means there's no feedback to report if we reach boot timeout while waiting for it. OTOH without the--wait
option, in the event of a stuck boot script we get a systemd state like "starting" or "degraded" included with a-d-c's "boot failed" log entry.18027-boot-probe-command @ b169b40808a571663b695d638fb2915625624734 -- developer-run-tests: #2683
Trying to get a successful run for the developer-run-tests-apps-workbench-integration ci job at developer-run-tests-apps-workbench-integration: #2849 , it seems hard to imagine the failures are real/related but just making sure.
Otherwise, LGTM, thanks.
Updated by Tom Clegg over 3 years ago
- Status changed from In Progress to Resolved
Applied in changeset arvados-private:commit:arvados|f57467be5f17a0894b213f632af9af53b6e57bdc.
Updated by Ward Vandewege almost 3 years ago
Tom Clegg wrote:
arvados-dispatch-cloud uses "sudo docker ps -q" as a BootProbeCommand, but the node isn't really ready yet when docker starts. During the boot process docker appears to get restarted (which is bound to cause trouble)
This isn't entirely accurate - the compute image comes up with docker stopped. Cf. https://github.com/arvados/arvados/blob/c1e7f148bf3340300ae2f41d1ba7588cdfbb3b42/tools/compute-images/scripts/base.sh#L123
So it's not that the /usr/local/bin/ensure-encrypted-partitions.sh
script will restart docker, it will start it after it has done its thing. Cf. https://github.com/arvados/arvados/blob/c1e7f148bf3340300ae2f41d1ba7588cdfbb3b42/tools/compute-images/scripts/usr-local-bin-ensure-encrypted-partitions.sh#L141 (though, admittedly, the comment there is misleading).
That said, booting with a stopped docker is probably not enough. Consider what happens when docker is stopped via systemd:
# systemctl stop docker Warning: Stopping docker.service, but it can still be activated by: docker.socket
And sure enough, any execution of the docker
command appears to restart it:
# docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES # systemctl status docker ● docker.service - Docker Application Container Engine Loaded: loaded (/lib/systemd/system/docker.service; enabled; vendor preset: enabled) Active: active (running) since Thu 2022-01-27 21:59:41 UTC; 4s ago Docs: https://docs.docker.com Main PID: 25704 (dockerd) Tasks: 7 CGroup: /system.slice/docker.service └─25704 /usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock
It seems likely, then, that the mere execution of the BootProbeCommand
(configured to sudo docker ps -q
) on behalf of a-d-c would have started docker again, causing the problem observed in this ticket (and again, in #18686).