Feature #17813
closedCache docker-to-singularity image conversions
Description
Currently, when crunch-run is in singularity mode, it converts the specified docker image to a .sif file before starting the container. The conversion is slow, and the .sif file is not saved anywhere, so the entire process is slow even when the same image is used for lots of steps in a workflow, sequential containers running on the same worker node, etc.
We can speed this up in most cases by saving the .sif file in Keep.- Before converting the docker image, check a project called "auto-generated singularity images" in a ".cache" project in the home project (runtime_user_uuid) for a collection named "singularity image for X" (where X is the docker image PDH) containing a single .sif file. If that exists, extend the trash_at time to now+2w, and use that .sif file instead of converting the docker image.
- If that doesn't work out, convert the docker image as we do now.
- After converting, save the .sif file in "singularity image for X" with trash_at=now+2w and properties {"type":"singularity image","docker_image_pdh":X}.
Updated by Tom Clegg over 3 years ago
- Related to Idea #16305: Singularity support added
Updated by Peter Amstutz over 3 years ago
- Target version deleted (
Arvados Future Sprints)
Updated by Peter Amstutz over 3 years ago
- Target version set to 2021-07-21 sprint
Updated by Tom Clegg over 3 years ago
- Assigned To changed from Peter Amstutz to Tom Clegg
Updated by Peter Amstutz over 3 years ago
- Assigned To changed from Tom Clegg to Peter Amstutz
Updated by Peter Amstutz over 3 years ago
17813-docker-to-singularity @ cc476f769c06ac6150a4f6d406eb2656f41e1dc3
- Adds image caching
- Loads the sif image from the keep mount, which makes starting up very fast
- Fixed bug in setting $HOME
- Passes all the CWL tests except one, which is about handling docker ENTRYPOINT, which get skipped by cwltool too
- Fixed API server bug with default value of runtime_auth_scopes
todo: figure out how to write a test for this feature
todo: set TTL on the cache image
Updated by Peter Amstutz over 3 years ago
- Status changed from New to In Progress
Updated by Peter Amstutz over 3 years ago
- Target version changed from 2021-07-21 sprint to 2021-08-04 sprint
Updated by Tom Clegg over 3 years ago
Some initial comments on work in progress:
Rather than adding the optional SetArvadoClient() func and accommodating the "if e.containerClient == nil { return }" cases, wouldn't it be simpler overall (and easier to test) if we pass the arvados.Client etc. to LoadImage() and have LoadImage call funcs like e.sifCacheFilename(), e.convertDockerImage(src, dst)? It seems like the only real reason ImageLoaded exists as a separate func (instead of just printing that "already loaded" / "loading" log message from LoadImage()) is that that's how the Docker tests were written.
Rather than staging the SIF file on local scratch space and then copying it to Keep, could we create a cache collection with a temp-timestamped name, tell singularity to write the .sif file directly to keep/by_id/$x/image.sif, then rename the cache collection into place if successful? (I see the description specifically says to convert and then copy, but the only advantage I can think of is that it still works even if we can't create a temp collection, and I'm not sure that's a case we should bother catering to.)
Updated by Peter Amstutz over 3 years ago
17813-docker-to-singularity @ e5199566d2bdfe207d8d06ba0f555cca801fd31b
- Refactored to use keep mount for both reading and writing
- Updates TTL on the collection
Should still probably add a singularity specific unit test that checks the cache behavior, but it's been confirmed to work with manual testing of running workflows and checking the contents of the .cache project.
Updated by Tom Clegg over 3 years ago
Suggest setting trash_at when first creating the sif cache collection -- that way if something goes wrong before/when we rename it into place, the (presumably useless) temp collection won't stick around forever. I think this is what the "two containers convert the same docker image and only one wins the race to rename into place" case will look like.
On that note, maybe worth changing the "error updating collection trash_at" log message, so it's more obvious that it's referring to the cached sif file and won't affect the container outcome?
I'm guessing this is a stray debug printf?
exec.Command("find", arvMountPoint+"/by_id/").Run()
Rest LGTM, thanks. As discussed in meeting I think it's fine to merge without testing the cache behavior per se.
Updated by Peter Amstutz over 3 years ago
Tom Clegg wrote:
Suggest setting trash_at when first creating the sif cache collection -- that way if something goes wrong before/when we rename it into place, the (presumably useless) temp collection won't stick around forever. I think this is what the "two containers convert the same docker image and only one wins the race to rename into place" case will look like.
On that note, maybe worth changing the "error updating collection trash_at" log message, so it's more obvious that it's referring to the cached sif file and won't affect the container outcome?
Yes. I also realized that it didn't handle the conflict case successfully (the response would not include the portable data hash), so that should be fixed now.
I'm guessing this is a stray debug printf?
Yes, I was getting "not found" errors and wasn't sure what was going on with the mount, but it turned out to be because the "by_id" directory was pdh-only, which is why I had to add a "by_uuid" mount to make this work.
17813-docker-to-singularity @ 72371e98271a56aca2723fcc54b861db69299bba
Updated by Tom Clegg over 3 years ago
Looks like if "create new cache collection" returns an error, we log the error but then return a Collection struct with an empty UUID and nil error, and LoadImage doesn't look like it handles that well. Should be return nil, fmt.Errorf(...) there instead of e.logf(...)?
Updated by Tom Clegg over 3 years ago
- Status changed from In Progress to Resolved