Project

General

Profile

Actions

Bug #18027

closed

[crunch-run] upload failed (singularity)

Added by Ward Vandewege over 3 years ago. Updated almost 3 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Crunch
Target version:
Story points:
-
Release relationship:
Auto

Description

From https://dev.arvados.org/issues/17755#note-27 ===================================================================================================
The 3rd failure is puzzling, if you check stderr it actually did run to the end, but failed the upload. This happened after the intervention of being stuck in queued state and killing the compute node.

2021-08-07T00:08:37.343893355Z Executing container 'tordo-dz642-xm3fbl5he5o2aul'
...
2021-08-07T00:08:39.823076366Z Waiting for container to finish
2021-08-07T00:10:47.551657400Z error in CaptureOutput: error scanning files to copy to output: lstat "/var/spool/cwl": lstat /tmp/crunch-run.tordo-dz642-xm3fbl5he5o2aul.609138299/tmp536330368: no such file or directory
2021-08-07T00:10:47.568368336Z Cancelled


Subtasks 2 (0 open2 closed)

Task #18035: Review 18027-unmount-fuseResolvedTom Clegg09/03/2021Actions
Task #18119: Review 18027-boot-probe-commandResolvedTom Clegg09/03/2021Actions

Related issues 2 (0 open2 closed)

Related to Arvados - Idea #17755: Test singularity support on a cloud cluster by running some real workflowsResolvedWard Vandewege09/03/2021Actions
Related to Arvados - Bug #18026: [singularity] crunch-run image cache race conditionResolvedPeter Amstutz08/24/2021Actions
Actions #1

Updated by Ward Vandewege over 3 years ago

  • Description updated (diff)
Actions #2

Updated by Ward Vandewege over 3 years ago

  • Related to Idea #17755: Test singularity support on a cloud cluster by running some real workflows added
Actions #3

Updated by Peter Amstutz over 3 years ago

  • Category set to Crunch
Actions #4

Updated by Peter Amstutz over 3 years ago

  • Assigned To set to Peter Amstutz
Actions #5

Updated by Peter Amstutz over 3 years ago

  • Related to Bug #18026: [singularity] crunch-run image cache race condition added
Actions #6

Updated by Peter Amstutz over 3 years ago

  • Target version changed from 2021-09-01 sprint to 2021-09-15 sprint
Actions #7

Updated by Peter Amstutz over 3 years ago

  • Assigned To deleted (Peter Amstutz)
Actions #8

Updated by Peter Amstutz over 3 years ago

  • Assigned To set to Tom Clegg
Actions #9

Updated by Tom Clegg over 3 years ago

This is a tordo image / configuration bug.

I ran a "sleep 1800" container and logged in to the node to see journalctl.

arvados-dispatch-cloud uses "sudo docker ps -q" as a BootProbeCommand, but the node isn't really ready yet when docker starts. During the boot process docker appears to get restarted (which is bound to cause trouble) and the encrypted filesystem is mounted at /tmp (which will cause exactly the problem we see in this ticket where crunch-run writes some files in /tmp only to have them mysteriously disappear).

Sep 02 20:52:27 ip-10-253-254-10 cloud-init[628]: + cryptsetup --key-file /var/tmp/key-WvBb7fhA.tmp luksOpen /dev/mapper/compute-tmp tmp
...
Sep 02 20:52:31 ip-10-253-254-10 crunch-run[1108]: tordo-dz642-wd0nj7l9cjxoi7o 2021-09-02T20:52:31.976846994Z Fetching Docker image from collection '4ad7d11381df349e464694762db14e04+303'
Sep 02 20:52:32 ip-10-253-254-10 crunch-run[1108]: tordo-dz642-wd0nj7l9cjxoi7o 2021-09-02T20:52:32.017126993Z Using Docker image id "sha256:e67b8c126d8f2d411d72aa04cc0ab87dace18eef152d5f0b07dd677284fc0002" 
Sep 02 20:52:32 ip-10-253-254-10 crunch-run[1108]: tordo-dz642-wd0nj7l9cjxoi7o 2021-09-02T20:52:32.017155204Z Loading Docker image from keep
Sep 02 20:52:32 ip-10-253-254-10 systemd[1]: docker.service: Succeeded.
Sep 02 20:52:32 ip-10-253-254-10 systemd[1]: Stopped Docker Application Container Engine.
Sep 02 20:52:32 ip-10-253-254-10 systemd[1]: docker.socket: Succeeded.
Sep 02 20:52:32 ip-10-253-254-10 systemd[1]: Closed Docker Socket for the API.
Sep 02 20:52:32 ip-10-253-254-10 cloud-init[628]: + ensure_umount /tmp/docker/aufs
Sep 02 20:52:32 ip-10-253-254-10 cloud-init[628]: + findmntq /tmp/docker/aufs
Sep 02 20:52:32 ip-10-253-254-10 cloud-init[628]: + findmnt /tmp/docker/aufs
Sep 02 20:52:32 ip-10-253-254-10 cloud-init[628]: + MOUNTOPTIONS=async
Sep 02 20:52:32 ip-10-253-254-10 cloud-init[628]: + mount -o async /dev/mapper/tmp /tmp
Sep 02 20:52:33 ip-10-253-254-10 sh[585]: Traceback (most recent call last):
Sep 02 20:52:33 ip-10-253-254-10 sh[585]:   File "/usr/lib/python3.7/http/client.py", line 558, in _get_chunk_left
Sep 02 20:52:33 ip-10-253-254-10 sh[585]:     chunk_left = self._read_next_chunk_size()
Sep 02 20:52:33 ip-10-253-254-10 sh[585]:   File "/usr/lib/python3.7/http/client.py", line 525, in _read_next_chunk_size
Sep 02 20:52:33 ip-10-253-254-10 sh[585]:     return int(line, 16)
...
Sep 02 20:52:33 ip-10-253-254-10 sh[585]:   File "/usr/share/python3/dist/arvados-docker-cleaner/lib/python3.7/site-packages/urllib3/response.py", line 455, in _error_catcher
Sep 02 20:52:33 ip-10-253-254-10 sh[585]:     raise ProtocolError("Connection broken: %r" % e, e)
Sep 02 20:52:33 ip-10-253-254-10 sh[585]: urllib3.exceptions.ProtocolError: ('Connection broken: IncompleteRead(0 bytes read)', IncompleteRead(0 bytes read))
Sep 02 20:52:33 ip-10-253-254-10 systemd[1]: arvados-docker-cleaner.service: Main process exited, code=exited, status=1/FAILURE
Sep 02 20:52:33 ip-10-253-254-10 systemd[1]: arvados-docker-cleaner.service: Failed with result 'exit-code'.
Sep 02 20:52:33 ip-10-253-254-10 crunch-run[1108]: tordo-dz642-wd0nj7l9cjxoi7o 2021-09-02T20:52:33.192786081Z building singularity image
Sep 02 20:52:33 ip-10-253-254-10 crunch-run[1108]: tordo-dz642-wd0nj7l9cjxoi7o 2021-09-02T20:52:33.250629814Z [singularity build /tmp/crunch-run.tordo-dz642-wd0nj7l9cjxoi7o.800734095/keep912285602/by_uuid/tordo-4zz18-gi4qh4ynirsg9vo/image.sif docker-archive:///tmp/crunch-run-sing
Sep 02 20:52:33 ip-10-253-254-10 kernel: SGI XFS with ACLs, security attributes, realtime, no debug enabled
Sep 02 20:52:33 ip-10-253-254-10 kernel: XFS (dm-1): Mounting V5 Filesystem
Sep 02 20:52:33 ip-10-253-254-10 kernel: XFS (dm-1): Ending clean mount
Sep 02 20:52:33 ip-10-253-254-10 cloud-init[628]: + chmod a+w,+t /tmp
Sep 02 20:52:33 ip-10-253-254-10 cloud-init[628]: + cat
Sep 02 20:52:33 ip-10-253-254-10 cloud-init[628]: + '[' -d /etc/sv/docker.io ']'
Sep 02 20:52:33 ip-10-253-254-10 cloud-init[628]: + service docker start
Sep 02 20:52:33 ip-10-253-254-10 systemd[1]: Starting Docker Socket for the API.
Sep 02 20:52:33 ip-10-253-254-10 systemd[1]: Listening on Docker Socket for the API.
Sep 02 20:52:33 ip-10-253-254-10 systemd[1]: Starting Docker Application Container Engine...
...
Actions #10

Updated by Tom Clegg over 3 years ago

  • Status changed from New to In Progress
Actions #11

Updated by Tom Clegg over 3 years ago

While investigating I found a couple of other things to fix
  • 22102137c 18027: Kill arv-mount directly if arv-mount --unmount cannot run.
    • Not sure how often this happens, but if we can't even start "arv-mount --unmount" (arv-mount command has disappeared from PATH, cannot fork, etc) it seems like we should at least send SIGKILL to our arv-mount process before giving up.
  • 9ce401977 18027: Fix crash when reporting error during --unmount.
    • Python3 error... sys.stderr.write(somebytes) crashes so we never get to see the fusermount error message. This happened in the tordo container -- presumably the error was that the mount point didn't exist, because see above.

18027-unmount-fuse @ 22102137c1e5eda8543c79d30d963ceb9e9a47af -- developer-run-tests: #2671

Actions #12

Updated by Peter Amstutz over 3 years ago

Tom Clegg wrote:

While investigating I found a couple of other things to fix
  • 22102137c 18027: Kill arv-mount directly if arv-mount --unmount cannot run.
    • Not sure how often this happens, but if we can't even start "arv-mount --unmount" (arv-mount command has disappeared from PATH, cannot fork, etc) it seems like we should at least send SIGKILL to our arv-mount process before giving up.

Probably related to --unmount not working, I noticed that even though it was intended to operate in a way that it wouldn't get stuck in the "dead mount" tar baby, it actually still does. Maybe we should try again to fix that?

  • 9ce401977 18027: Fix crash when reporting error during --unmount.
    • Python3 error... sys.stderr.write(somebytes) crashes so we never get to see the fusermount error message. This happened in the tordo container -- presumably the error was that the mount point didn't exist, because see above.

Great, yea, I've seen that happen a few times, it's embarassing.

18027-unmount-fuse @ 22102137c1e5eda8543c79d30d963ceb9e9a47af -- developer-run-tests: #2671

This branch LGTM

Actions #13

Updated by Tom Clegg over 3 years ago

  • Status changed from In Progress to Resolved
Actions #14

Updated by Tom Clegg over 3 years ago

Changed default boot probe command to "systemctl is-system-running".

Considered adding "--wait", but
  • it isn't supported in medium-old systemd versions, and it would be super annoying to have a default boot probe that always fails.
  • --wait doesn't seem to offer a timeout option, which means there's no feedback to report if we reach boot timeout while waiting for it. OTOH without the --wait option, in the event of a stuck boot script we get a systemd state like "starting" or "degraded" included with a-d-c's "boot failed" log entry.

18027-boot-probe-command @ b169b40808a571663b695d638fb2915625624734 -- developer-run-tests: #2683

Actions #15

Updated by Tom Clegg over 3 years ago

  • Status changed from Resolved to In Progress
Actions #16

Updated by Ward Vandewege over 3 years ago

Tom Clegg wrote:

Changed default boot probe command to "systemctl is-system-running".

Considered adding "--wait", but
  • it isn't supported in medium-old systemd versions, and it would be super annoying to have a default boot probe that always fails.
  • --wait doesn't seem to offer a timeout option, which means there's no feedback to report if we reach boot timeout while waiting for it. OTOH without the --wait option, in the event of a stuck boot script we get a systemd state like "starting" or "degraded" included with a-d-c's "boot failed" log entry.

18027-boot-probe-command @ b169b40808a571663b695d638fb2915625624734 -- developer-run-tests: #2683

Trying to get a successful run for the developer-run-tests-apps-workbench-integration ci job at developer-run-tests-apps-workbench-integration: #2849 , it seems hard to imagine the failures are real/related but just making sure.

Otherwise, LGTM, thanks.

Actions #17

Updated by Tom Clegg over 3 years ago

  • Status changed from In Progress to Resolved

Applied in changeset arvados-private:commit:arvados|f57467be5f17a0894b213f632af9af53b6e57bdc.

Actions #18

Updated by Peter Amstutz about 3 years ago

  • Release set to 42
Actions #19

Updated by Ward Vandewege almost 3 years ago

Tom Clegg wrote:

arvados-dispatch-cloud uses "sudo docker ps -q" as a BootProbeCommand, but the node isn't really ready yet when docker starts. During the boot process docker appears to get restarted (which is bound to cause trouble)

This isn't entirely accurate - the compute image comes up with docker stopped. Cf. https://github.com/arvados/arvados/blob/c1e7f148bf3340300ae2f41d1ba7588cdfbb3b42/tools/compute-images/scripts/base.sh#L123

So it's not that the /usr/local/bin/ensure-encrypted-partitions.sh script will restart docker, it will start it after it has done its thing. Cf. https://github.com/arvados/arvados/blob/c1e7f148bf3340300ae2f41d1ba7588cdfbb3b42/tools/compute-images/scripts/usr-local-bin-ensure-encrypted-partitions.sh#L141 (though, admittedly, the comment there is misleading).

That said, booting with a stopped docker is probably not enough. Consider what happens when docker is stopped via systemd:

# systemctl stop docker
Warning: Stopping docker.service, but it can still be activated by:
  docker.socket

And sure enough, any execution of the docker command appears to restart it:

# docker ps
CONTAINER ID   IMAGE     COMMAND   CREATED   STATUS    PORTS     NAMES

# systemctl status docker
● docker.service - Docker Application Container Engine
   Loaded: loaded (/lib/systemd/system/docker.service; enabled; vendor preset: enabled)
   Active: active (running) since Thu 2022-01-27 21:59:41 UTC; 4s ago
     Docs: https://docs.docker.com
 Main PID: 25704 (dockerd)
    Tasks: 7
   CGroup: /system.slice/docker.service
           └─25704 /usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock

It seems likely, then, that the mere execution of the BootProbeCommand (configured to sudo docker ps -q) on behalf of a-d-c would have started docker again, causing the problem observed in this ticket (and again, in #18686).

Actions

Also available in: Atom PDF