Feature #17813

Cache docker-to-singularity image conversions

Added by Tom Clegg 3 months ago. Updated about 2 months ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Crunch
Target version:
Start date:
07/23/2021
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
-

Description

Currently, when crunch-run is in singularity mode, it converts the specified docker image to a .sif file before starting the container. The conversion is slow, and the .sif file is not saved anywhere, so the entire process is slow even when the same image is used for lots of steps in a workflow, sequential containers running on the same worker node, etc.

We can speed this up in most cases by saving the .sif file in Keep.
  • Before converting the docker image, check a project called "auto-generated singularity images" in a ".cache" project in the home project (runtime_user_uuid) for a collection named "singularity image for X" (where X is the docker image PDH) containing a single .sif file. If that exists, extend the trash_at time to now+2w, and use that .sif file instead of converting the docker image.
  • If that doesn't work out, convert the docker image as we do now.
  • After converting, save the .sif file in "singularity image for X" with trash_at=now+2w and properties {"type":"singularity image","docker_image_pdh":X}.

Subtasks

Task #17867: Review 17813-docker-to-singularityResolvedTom Clegg

Task #17921: implementResolvedPeter Amstutz


Related issues

Related to Arvados Epics - Story #16305: Singularity supportIn Progress01/01/202109/30/2021

Associated revisions

Revision ce454513
Added by Tom Clegg 2 months ago

Merge branch '17813-docker-to-singularity' into main

refs #17813

Arvados-DCO-1.1-Signed-off-by: Tom Clegg <>

History

#1 Updated by Tom Clegg 3 months ago

#2 Updated by Tom Clegg 3 months ago

  • Description updated (diff)

#3 Updated by Tom Clegg 3 months ago

  • Description updated (diff)

#4 Updated by Tom Clegg 3 months ago

  • Description updated (diff)

#5 Updated by Tom Clegg 3 months ago

  • Description updated (diff)

#6 Updated by Peter Amstutz 3 months ago

  • Target version deleted (Arvados Future Sprints)

#7 Updated by Peter Amstutz 3 months ago

  • Target version set to 2021-07-21 sprint

#8 Updated by Peter Amstutz 3 months ago

  • Assigned To set to Peter Amstutz

#9 Updated by Tom Clegg 3 months ago

  • Assigned To changed from Peter Amstutz to Tom Clegg

#10 Updated by Peter Amstutz 2 months ago

  • Assigned To changed from Tom Clegg to Peter Amstutz

#11 Updated by Peter Amstutz 2 months ago

17813-docker-to-singularity @ cc476f769c06ac6150a4f6d406eb2656f41e1dc3

  • Adds image caching
  • Loads the sif image from the keep mount, which makes starting up very fast
  • Fixed bug in setting $HOME
  • Passes all the CWL tests except one, which is about handling docker ENTRYPOINT, which get skipped by cwltool too
  • Fixed API server bug with default value of runtime_auth_scopes

todo: figure out how to write a test for this feature
todo: set TTL on the cache image

https://ci.arvados.org/view/Developer/job/developer-run-tests/2599/

#12 Updated by Peter Amstutz 2 months ago

  • Status changed from New to In Progress

#13 Updated by Peter Amstutz 2 months ago

  • Target version changed from 2021-07-21 sprint to 2021-08-04 sprint

#14 Updated by Tom Clegg 2 months ago

Some initial comments on work in progress:

Rather than adding the optional SetArvadoClient() func and accommodating the "if e.containerClient == nil { return }" cases, wouldn't it be simpler overall (and easier to test) if we pass the arvados.Client etc. to LoadImage() and have LoadImage call funcs like e.sifCacheFilename(), e.convertDockerImage(src, dst)? It seems like the only real reason ImageLoaded exists as a separate func (instead of just printing that "already loaded" / "loading" log message from LoadImage()) is that that's how the Docker tests were written.

Rather than staging the SIF file on local scratch space and then copying it to Keep, could we create a cache collection with a temp-timestamped name, tell singularity to write the .sif file directly to keep/by_id/$x/image.sif, then rename the cache collection into place if successful? (I see the description specifically says to convert and then copy, but the only advantage I can think of is that it still works even if we can't create a temp collection, and I'm not sure that's a case we should bother catering to.)

#15 Updated by Peter Amstutz 2 months ago

17813-docker-to-singularity @ e5199566d2bdfe207d8d06ba0f555cca801fd31b

  • Refactored to use keep mount for both reading and writing
  • Updates TTL on the collection

Should still probably add a singularity specific unit test that checks the cache behavior, but it's been confirmed to work with manual testing of running workflows and checking the contents of the .cache project.

https://ci.arvados.org/view/Developer/job/developer-run-tests/2605/

#16 Updated by Tom Clegg 2 months ago

Suggest setting trash_at when first creating the sif cache collection -- that way if something goes wrong before/when we rename it into place, the (presumably useless) temp collection won't stick around forever. I think this is what the "two containers convert the same docker image and only one wins the race to rename into place" case will look like.

On that note, maybe worth changing the "error updating collection trash_at" log message, so it's more obvious that it's referring to the cached sif file and won't affect the container outcome?

I'm guessing this is a stray debug printf?

               exec.Command("find", arvMountPoint+"/by_id/").Run()

Rest LGTM, thanks. As discussed in meeting I think it's fine to merge without testing the cache behavior per se.

#17 Updated by Peter Amstutz 2 months ago

Tom Clegg wrote:

Suggest setting trash_at when first creating the sif cache collection -- that way if something goes wrong before/when we rename it into place, the (presumably useless) temp collection won't stick around forever. I think this is what the "two containers convert the same docker image and only one wins the race to rename into place" case will look like.

On that note, maybe worth changing the "error updating collection trash_at" log message, so it's more obvious that it's referring to the cached sif file and won't affect the container outcome?

Yes. I also realized that it didn't handle the conflict case successfully (the response would not include the portable data hash), so that should be fixed now.

I'm guessing this is a stray debug printf?

Yes, I was getting "not found" errors and wasn't sure what was going on with the mount, but it turned out to be because the "by_id" directory was pdh-only, which is why I had to add a "by_uuid" mount to make this work.

17813-docker-to-singularity @ 72371e98271a56aca2723fcc54b861db69299bba

https://ci.arvados.org/view/Developer/job/developer-run-tests/2607/

#18 Updated by Tom Clegg 2 months ago

Looks like if "create new cache collection" returns an error, we log the error but then return a Collection struct with an empty UUID and nil error, and LoadImage doesn't look like it handles that well. Should be return nil, fmt.Errorf(...) there instead of e.logf(...)?

#19 Updated by Tom Clegg about 2 months ago

  • Status changed from In Progress to Resolved

Also available in: Atom PDF