Bug #18027

[crunch-run] upload failed (singularity)

Added by Ward Vandewege 2 months ago. Updated about 1 month ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Crunch
Target version:
Start date:
09/03/2021
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
-

Description

From https://dev.arvados.org/issues/17755#note-27 ===================================================================================================
The 3rd failure is puzzling, if you check stderr it actually did run to the end, but failed the upload. This happened after the intervention of being stuck in queued state and killing the compute node.

2021-08-07T00:08:37.343893355Z Executing container 'tordo-dz642-xm3fbl5he5o2aul'
...
2021-08-07T00:08:39.823076366Z Waiting for container to finish
2021-08-07T00:10:47.551657400Z error in CaptureOutput: error scanning files to copy to output: lstat "/var/spool/cwl": lstat /tmp/crunch-run.tordo-dz642-xm3fbl5he5o2aul.609138299/tmp536330368: no such file or directory
2021-08-07T00:10:47.568368336Z Cancelled


Subtasks

Task #18035: Review 18027-unmount-fuseResolvedTom Clegg

Task #18119: Review 18027-boot-probe-commandResolvedTom Clegg


Related issues

Related to Arvados - Story #17755: Test singularity support on a cloud cluster by running some real workflowsResolved09/03/2021

Related to Arvados - Bug #18026: [singularity] crunch-run image cache race conditionResolved08/24/2021

History

#1 Updated by Ward Vandewege 2 months ago

  • Description updated (diff)

#2 Updated by Ward Vandewege 2 months ago

  • Related to Story #17755: Test singularity support on a cloud cluster by running some real workflows added

#3 Updated by Peter Amstutz 2 months ago

  • Category set to Crunch

#4 Updated by Peter Amstutz 2 months ago

  • Assigned To set to Peter Amstutz

#5 Updated by Peter Amstutz about 2 months ago

  • Related to Bug #18026: [singularity] crunch-run image cache race condition added

#6 Updated by Peter Amstutz about 2 months ago

  • Target version changed from 2021-09-01 sprint to 2021-09-15 sprint

#7 Updated by Peter Amstutz about 2 months ago

  • Assigned To deleted (Peter Amstutz)

#8 Updated by Peter Amstutz about 2 months ago

  • Assigned To set to Tom Clegg

#9 Updated by Tom Clegg about 2 months ago

This is a tordo image / configuration bug.

I ran a "sleep 1800" container and logged in to the node to see journalctl.

arvados-dispatch-cloud uses "sudo docker ps -q" as a BootProbeCommand, but the node isn't really ready yet when docker starts. During the boot process docker appears to get restarted (which is bound to cause trouble) and the encrypted filesystem is mounted at /tmp (which will cause exactly the problem we see in this ticket where crunch-run writes some files in /tmp only to have them mysteriously disappear).

Sep 02 20:52:27 ip-10-253-254-10 cloud-init[628]: + cryptsetup --key-file /var/tmp/key-WvBb7fhA.tmp luksOpen /dev/mapper/compute-tmp tmp
...
Sep 02 20:52:31 ip-10-253-254-10 crunch-run[1108]: tordo-dz642-wd0nj7l9cjxoi7o 2021-09-02T20:52:31.976846994Z Fetching Docker image from collection '4ad7d11381df349e464694762db14e04+303'
Sep 02 20:52:32 ip-10-253-254-10 crunch-run[1108]: tordo-dz642-wd0nj7l9cjxoi7o 2021-09-02T20:52:32.017126993Z Using Docker image id "sha256:e67b8c126d8f2d411d72aa04cc0ab87dace18eef152d5f0b07dd677284fc0002" 
Sep 02 20:52:32 ip-10-253-254-10 crunch-run[1108]: tordo-dz642-wd0nj7l9cjxoi7o 2021-09-02T20:52:32.017155204Z Loading Docker image from keep
Sep 02 20:52:32 ip-10-253-254-10 systemd[1]: docker.service: Succeeded.
Sep 02 20:52:32 ip-10-253-254-10 systemd[1]: Stopped Docker Application Container Engine.
Sep 02 20:52:32 ip-10-253-254-10 systemd[1]: docker.socket: Succeeded.
Sep 02 20:52:32 ip-10-253-254-10 systemd[1]: Closed Docker Socket for the API.
Sep 02 20:52:32 ip-10-253-254-10 cloud-init[628]: + ensure_umount /tmp/docker/aufs
Sep 02 20:52:32 ip-10-253-254-10 cloud-init[628]: + findmntq /tmp/docker/aufs
Sep 02 20:52:32 ip-10-253-254-10 cloud-init[628]: + findmnt /tmp/docker/aufs
Sep 02 20:52:32 ip-10-253-254-10 cloud-init[628]: + MOUNTOPTIONS=async
Sep 02 20:52:32 ip-10-253-254-10 cloud-init[628]: + mount -o async /dev/mapper/tmp /tmp
Sep 02 20:52:33 ip-10-253-254-10 sh[585]: Traceback (most recent call last):
Sep 02 20:52:33 ip-10-253-254-10 sh[585]:   File "/usr/lib/python3.7/http/client.py", line 558, in _get_chunk_left
Sep 02 20:52:33 ip-10-253-254-10 sh[585]:     chunk_left = self._read_next_chunk_size()
Sep 02 20:52:33 ip-10-253-254-10 sh[585]:   File "/usr/lib/python3.7/http/client.py", line 525, in _read_next_chunk_size
Sep 02 20:52:33 ip-10-253-254-10 sh[585]:     return int(line, 16)
...
Sep 02 20:52:33 ip-10-253-254-10 sh[585]:   File "/usr/share/python3/dist/arvados-docker-cleaner/lib/python3.7/site-packages/urllib3/response.py", line 455, in _error_catcher
Sep 02 20:52:33 ip-10-253-254-10 sh[585]:     raise ProtocolError("Connection broken: %r" % e, e)
Sep 02 20:52:33 ip-10-253-254-10 sh[585]: urllib3.exceptions.ProtocolError: ('Connection broken: IncompleteRead(0 bytes read)', IncompleteRead(0 bytes read))
Sep 02 20:52:33 ip-10-253-254-10 systemd[1]: arvados-docker-cleaner.service: Main process exited, code=exited, status=1/FAILURE
Sep 02 20:52:33 ip-10-253-254-10 systemd[1]: arvados-docker-cleaner.service: Failed with result 'exit-code'.
Sep 02 20:52:33 ip-10-253-254-10 crunch-run[1108]: tordo-dz642-wd0nj7l9cjxoi7o 2021-09-02T20:52:33.192786081Z building singularity image
Sep 02 20:52:33 ip-10-253-254-10 crunch-run[1108]: tordo-dz642-wd0nj7l9cjxoi7o 2021-09-02T20:52:33.250629814Z [singularity build /tmp/crunch-run.tordo-dz642-wd0nj7l9cjxoi7o.800734095/keep912285602/by_uuid/tordo-4zz18-gi4qh4ynirsg9vo/image.sif docker-archive:///tmp/crunch-run-sing
Sep 02 20:52:33 ip-10-253-254-10 kernel: SGI XFS with ACLs, security attributes, realtime, no debug enabled
Sep 02 20:52:33 ip-10-253-254-10 kernel: XFS (dm-1): Mounting V5 Filesystem
Sep 02 20:52:33 ip-10-253-254-10 kernel: XFS (dm-1): Ending clean mount
Sep 02 20:52:33 ip-10-253-254-10 cloud-init[628]: + chmod a+w,+t /tmp
Sep 02 20:52:33 ip-10-253-254-10 cloud-init[628]: + cat
Sep 02 20:52:33 ip-10-253-254-10 cloud-init[628]: + '[' -d /etc/sv/docker.io ']'
Sep 02 20:52:33 ip-10-253-254-10 cloud-init[628]: + service docker start
Sep 02 20:52:33 ip-10-253-254-10 systemd[1]: Starting Docker Socket for the API.
Sep 02 20:52:33 ip-10-253-254-10 systemd[1]: Listening on Docker Socket for the API.
Sep 02 20:52:33 ip-10-253-254-10 systemd[1]: Starting Docker Application Container Engine...
...

#10 Updated by Tom Clegg about 2 months ago

  • Status changed from New to In Progress

#11 Updated by Tom Clegg about 2 months ago

While investigating I found a couple of other things to fix
  • 22102137c 18027: Kill arv-mount directly if arv-mount --unmount cannot run.
    • Not sure how often this happens, but if we can't even start "arv-mount --unmount" (arv-mount command has disappeared from PATH, cannot fork, etc) it seems like we should at least send SIGKILL to our arv-mount process before giving up.
  • 9ce401977 18027: Fix crash when reporting error during --unmount.
    • Python3 error... sys.stderr.write(somebytes) crashes so we never get to see the fusermount error message. This happened in the tordo container -- presumably the error was that the mount point didn't exist, because see above.

18027-unmount-fuse @ 22102137c1e5eda8543c79d30d963ceb9e9a47af -- https://ci.arvados.org/view/Developer/job/developer-run-tests/2671/

#12 Updated by Peter Amstutz about 1 month ago

Tom Clegg wrote:

While investigating I found a couple of other things to fix
  • 22102137c 18027: Kill arv-mount directly if arv-mount --unmount cannot run.
    • Not sure how often this happens, but if we can't even start "arv-mount --unmount" (arv-mount command has disappeared from PATH, cannot fork, etc) it seems like we should at least send SIGKILL to our arv-mount process before giving up.

Probably related to --unmount not working, I noticed that even though it was intended to operate in a way that it wouldn't get stuck in the "dead mount" tar baby, it actually still does. Maybe we should try again to fix that?

  • 9ce401977 18027: Fix crash when reporting error during --unmount.
    • Python3 error... sys.stderr.write(somebytes) crashes so we never get to see the fusermount error message. This happened in the tordo container -- presumably the error was that the mount point didn't exist, because see above.

Great, yea, I've seen that happen a few times, it's embarassing.

18027-unmount-fuse @ 22102137c1e5eda8543c79d30d963ceb9e9a47af -- https://ci.arvados.org/view/Developer/job/developer-run-tests/2671/

This branch LGTM

#13 Updated by Tom Clegg about 1 month ago

  • Status changed from In Progress to Resolved

#14 Updated by Tom Clegg about 1 month ago

Changed default boot probe command to "systemctl is-system-running".

Considered adding "--wait", but
  • it isn't supported in medium-old systemd versions, and it would be super annoying to have a default boot probe that always fails.
  • --wait doesn't seem to offer a timeout option, which means there's no feedback to report if we reach boot timeout while waiting for it. OTOH without the --wait option, in the event of a stuck boot script we get a systemd state like "starting" or "degraded" included with a-d-c's "boot failed" log entry.

18027-boot-probe-command @ b169b40808a571663b695d638fb2915625624734 -- https://ci.arvados.org/view/Developer/job/developer-run-tests/2683/

#15 Updated by Tom Clegg about 1 month ago

  • Status changed from Resolved to In Progress

#16 Updated by Ward Vandewege about 1 month ago

Tom Clegg wrote:

Changed default boot probe command to "systemctl is-system-running".

Considered adding "--wait", but
  • it isn't supported in medium-old systemd versions, and it would be super annoying to have a default boot probe that always fails.
  • --wait doesn't seem to offer a timeout option, which means there's no feedback to report if we reach boot timeout while waiting for it. OTOH without the --wait option, in the event of a stuck boot script we get a systemd state like "starting" or "degraded" included with a-d-c's "boot failed" log entry.

18027-boot-probe-command @ b169b40808a571663b695d638fb2915625624734 -- https://ci.arvados.org/view/Developer/job/developer-run-tests/2683/

Trying to get a successful run for the developer-run-tests-apps-workbench-integration ci job at https://ci.arvados.org/job/developer-run-tests-apps-workbench-integration/2849/, it seems hard to imagine the failures are real/related but just making sure.

Otherwise, LGTM, thanks.

#17 Updated by Tom Clegg about 1 month ago

  • Status changed from In Progress to Resolved

Applied in changeset arvados-private:commit:arvados|f57467be5f17a0894b213f632af9af53b6e57bdc.

Also available in: Atom PDF