Idea #17296
closedSingularity proof of concept
Description
Given the following restrictions:
- singularity 3.5.2 installed on the compute node.
- config.yml has "ContainerExec=Singularity" "SingularityBinary=/usr/bin/singularity"
- no privileges, setuid or special linux capabilities are needed to run the container (i.e. ping won't work because it needs CAP_NET_RAW)
- The space required in the compute node is 2 * Docker Image size uncompressed. Since the singularity build step requires a file on disk for the docker-archive:// schema
- When we have a ContainerRequest,
- /usr/bin/singularity build .... to create the .sif file
- /usr/bin/singularity run ... to run the container using --bind for all the container mounts
- date in realtime the container StdErr and StdOut
- store the outputs in a collection
Before starting the support for singularity we need to define an interface for all Container Executors.
type ContainerExecuter interface { // CheckImageIsLoaded checks if imageID is already in the local environment, // either something we can reference later in Docker API or similar, or a // filepath containing the image to be used CheckImageIsLoaded(imageID ImageID) bool // LoadImage translates a io.Reader that has a tarball in docker format // (Created with 'docker save'), and load it to the local // ContainerExecuter. // // Returns an ImageID to be referenced later, note that this could be an image // identifier or a filepath or something else LoadImage(containerImage io.Reader) (ImageID, error) // ImageRemove removes the image loaded using LoadImage() ImageRemove(imageID ImageID) error ContainerState() (ContainerState, error) // CreateContainer will prepare anything in the creation on the container // env is an array of string "KEY=VALUE" that represents the environment variables // containerConfig has all the parameters needed to start the container CreateContainer(containerConfig ContainerConfig) error // StartContainer will start the container StartContainer() error // Kill the container and optionally remove the underlying image returns an // error if it didn't work (including timeout) Kill() error // this is how https://golang.org/pkg/os/exec/#Cmd does it. StdinPipe() (io.WriteCloser, error) StdoutPipe() (io.ReadCloser, error) StderrPipe() (io.ReadCloser, error) // Wait for the container to finish Wait() error // Returns the exit code of the last executed container ExitCode() (int, error) }
When we run a container on a compute node, we do a container conversion, on the fly, to a SIF file, and run that with singularity instead.
- global option that switches between docker or singularity runner
- container_request runtime parameters flag that chooses between docker and singularity
- crunch-run gets docker tar file from keep (existing docker v2 format images)
- crunch-run converts docker tar file to SIF:
$ docker save arvados/jobs:latest > arvados-jobs.latest.tar $ ls -laF arvados-jobs.latest.tar -rw-r--r-- 1 ward ward 295209984 Jan 14 17:16 arvados-jobs.latest.tar $ singularity build arvados-jobs.latest.sif docker-archive://arvados-jobs.latest.tar INFO: Starting build... ...
- crunch-run executes singularity with mount points, stdout/stderr captured to logs
- slurm dispatcher supports singularity
- ideally the backend container runner should be transparent to the dispatcher
- proof of concept will be tested on 9tee4
- assume that user id inside the container will be the same as the crunch-run user (?)
- try to support running containers without setuid, identify specific features that require setuid on singularity binary.
Testing goals / acceptance criteria
- MVP: runs a container
- default value for singularity binary (/usr/bin/singularity) but can be changed from arvados config.yml
- captures stdout/stderr to logs
- can bind-mount arv-mount inside the container
- can bind mount tmp/output directories inside the container
- output files have proper permissions to be read for upload & cleaned up (deleted) by crunch-run
- see if it makes sense to have singularity mock the docker API
- should have similar test coverage of singularity features as exist to the Docker features
For future tickets:
- crunchstat
- memory / CPU constraints
- Save the SIF file in Keep and do something with another Link object to make it findable in the future, for the corresponding docker image. TODO: check if the framework we built in for the docker image format v1 -> v2 could be used here.
Related issues
Updated by Peter Amstutz almost 4 years ago
- Description updated (diff)
- Subject changed from Singularity MVP to Singularity proof of concept
Updated by Ward Vandewege almost 4 years ago
- Related to Idea #17241: Scoping/grooming Singularity support work added
Updated by Peter Amstutz almost 4 years ago
- Story points set to 5.0
- Description updated (diff)
Updated by Peter Amstutz almost 4 years ago
- Target version changed from 2021-02-17 sprint to 2021-03-03 sprint
Updated by Peter Amstutz almost 4 years ago
- Target version changed from 2021-03-03 sprint to 2021-03-17 sprint
Updated by Peter Amstutz over 3 years ago
- Assigned To changed from Tom Clegg to Nico César
Updated by Peter Amstutz over 3 years ago
- Related to Idea #16305: Singularity support added
Updated by Nico César over 3 years ago
very first take trying to create an abstraction that matches 1 to 1 with docker for now,
c1d1f0502b8a0f049dba41da2f6b19a0d4b03d77
TODO:
- [DONE] 0a27815bd review ContainerConfig, HostConfig settings and add them to the ThinContainerExecRunner interface as Get/Set methods to abstract from the internal representation
- review all the networking related options and see if can be simplified
- make a run with crunch-run --container-runner singularity to see how it behaves
- add tests related to singularity
Updated by Nico César over 3 years ago
Commit:0a27815bd 17241-singularity-take1
Updated by Peter Amstutz over 3 years ago
- Target version changed from 2021-03-17 sprint to 2021-03-31 sprint
Updated by Nico César over 3 years ago
As I'm reading all the documentation available about singularity I want to write down some notes:
It is also important to note that the philosophy of Singularity is Integration over Isolation. Most container run times strive to isolate your container from the host system and other containers as much as possible. Singularity, on the other hand, assumes that the user’s primary goals are portability, reproducibility, and ease of use and that isolation is often a tertiary concern.
Therefore, Singularity only isolates the mount namespace by default, and will bind mount several host directories such as $HOME and /tmp into the container at runtime. If needed, additional levels of isolation can be achieved by passing options causing Singularity to enter any or all of the other kernel namespaces and to prevent automatic bind mounting. These measures allow users to interact with the host system from within the container in sensible ways.
(taken from https://sylabs.io/guides/3.7/user-guide/security.html )
I see a potential problem here, since singularity tries to incorporate the HOST files as part of the container in a transparent way, this could cause problems if crunch-run is running everything with the same user and maybe in a shared environment
Updated by Nico César over 3 years ago
fccb8a3de 17241-singularity-take1
Tom let's review this branch before I continue down the wrong path
Updated by Nico César over 3 years ago
- Target version changed from 2021-03-31 sprint to 2021-04-14 sprint
Updated by Nico César over 3 years ago
43b39915bae3f3c24ab31cfbc7aefdce88f84dcb branch 17296-singularity-take2
we had a conversation with tom and we agree that lib/crunchrun/container_exec.go looks good to start making lib/crunchrun/container_exec_test.go to add all the tests for a contarner exec.
Updated by Peter Amstutz over 3 years ago
- Target version changed from 2021-04-14 sprint to 2021-05-12 sprint
Updated by Javier Bértoli over 3 years ago
Added singularity v3 to compute{0,1}.9tee4, removed it from api.9tee4 (commit ecfae2b..4a694a1@saltstack)
Updated by Tom Clegg over 3 years ago
- Assigned To changed from Nico César to Tom Clegg
Updated by Tom Clegg over 3 years ago
- Target version changed from 2021-05-12 sprint to 2021-05-26 sprint
Updated by Tom Clegg over 3 years ago
17296-singularity @ 8fd331405028ebdbb97de58560057564aa530105 -- developer-run-tests: #2477
I think this is approximately functional enough to merge as alpha/experimental. More work is definitely needed to make it a fully supported feature.
Limitations- Container gateway won't work with singularity. (Todo: make sure it at least fails gracefully, and mention in docs.)
- Singularity doesn't even attempt to pass through VCPUs/memory. (I tried passing VCPUs but it just made the container fail to start.)
- Singularity "build" step (convert docker image) fails if squashfs-tools is not installed. (Should we try to report this more gracefully?)
- No cache for the docker-to-singularity image conversion.
- crunchstat.txt tracks host stats instead of container stats. (Haven't looked into how to do cgroups with singularity.)
- We now read the docker image from arv-mount instead of a ManifestFileReader. Might give us slightly better startup latency due to readahead?
- Rewrote a lot of the testing code. Tests are still rather copy-pastey and could use more cleanup but I felt like I needed to stop short of rewriting Everything.
- Config key "Containers.RuntimeEngine" make sense? Should it be Containers.CloudVMs.RuntimeEngine instead?
- I noticed our "load docker image" code assumes the sha256 in the tarball filename matches the ID of the image and probably behaves strangely if it doesn't. (I didn't try to fix this.)
- Very exciting, we finally have an integration test for crunch-run that uses the real apiserver, arv-mount, and docker/singularity.
Updated by Tom Clegg over 3 years ago
17296-singularity @ 5a902f04bc1efdd30398195cb50df85ebab75d84 -- developer-run-tests: #2478
addresses a test failure (/var/lib/arvados/tmp isn't necessarily writable by the non-root user that runs tests).
Updated by Ward Vandewege over 3 years ago
Tom Clegg wrote:
17296-singularity @ 5a902f04bc1efdd30398195cb50df85ebab75d84 -- developer-run-tests: #2478
addresses a test failure (/var/lib/arvados/tmp isn't necessarily writable by the non-root user that runs tests).
- slight behavior change in loading docker images: before, the .tar file of the image had to be the first file in the collection, and that was loaded. Any subsequent .tar files were ignored. Now, it is an error if there is more than one .tar file. This change should probably be mentioned in the upgrade docs (though it's unlikely someone would run into it).
- since this branch does not add any documentation yet, I've created a story for Singularity documentation (#17726) to the epic and added a note there about the previous bullet.
- I think the config key "Containers.RuntimeEngine" makes sense; the plan is for singularity support to be available in non-cloud dispatchers as well, right?
- In RunCommand() (lib/crunchrun/crunchrun.go), line 1695 and later, would it make sense to print a warning for the first two conditions that cause the gateway service not to be started?
- Looks like the new integration test is failing because it can't find the docker binary. That is then followed by a panic in TearDownSuite (eek). I see you test for the presence of Docker in crunchrun/docker_test.go, but not yet in crunchrun/integration_test.go (developer-run-tests-remainder: #2592 /console):
14:24:59 ---------------------------------------------------------------------- 14:24:59 FAIL: integration_test.go:38: integrationSuite.SetUpSuite 14:24:59 14:24:59 14:24:59 integration_test.go:43: 14:24:59 c.Assert(err, IsNil) 14:24:59 ... value *exec.Error = &exec.Error{Name:"docker", Err:(*errors.errorString)(0xc0000a83c0)} ("exec: \"docker\": executable file not found in $PATH") 14:24:59 14:24:59 14:24:59 ---------------------------------------------------------------------- 14:24:59 PANIC: integration_test.go:76: integrationSuite.TearDownSuite 14:24:59 14:24:59 ... Panic: runtime error: invalid memory address or nil pointer dereference (PC=0x43A7D8) 14:24:59 14:24:59 /usr/src/go/src/runtime/panic.go:965 14:24:59 in gopanic 14:24:59 /usr/src/go/src/runtime/panic.go:212 14:24:59 in panicmem 14:24:59 /usr/src/go/src/runtime/signal_unix.go:734 14:24:59 in sigpanic 14:24:59 /tmp/workspace/developer-run-tests-remainder/sdk/go/arvados/client.go:321 14:24:59 in Client.RequestAndDecodeContext 14:24:59 /tmp/workspace/developer-run-tests-remainder/sdk/go/arvados/client.go:312 14:24:59 in Client.RequestAndDecode 14:24:59 integration_test.go:77 14:24:59 in integrationSuite.TearDownSuite 14:24:59 /usr/src/go/src/reflect/value.go:337 14:24:59 in Value.Call 14:24:59 /usr/src/go/src/runtime/asm_amd64.s:1371 14:24:59 in goexit
Other than that, LGTM.
Updated by Tom Clegg over 3 years ago
- added note about docker image collections (will need to adjust before merging so it goes in the forthcoming "upgrading from 2.2" section instead of 2.1)
- added logs when not starting gateway service
- added docker check so tests should pass on jenkins
Updated by Tom Clegg over 3 years ago
- Target version changed from 2021-05-26 sprint to 2021-06-09 sprint
Updated by Ward Vandewege over 3 years ago
Tom Clegg wrote:
17296-singularity @ fb5cab780281c6cfb834cbfae64f3fcac7918d19 -- developer-run-tests: #2495
- added note about docker image collections (will need to adjust before merging so it goes in the forthcoming "upgrading from 2.2" section instead of 2.1)
- added logs when not starting gateway service
- added docker check so tests should pass on jenkins
LGTM, thanks!
Updated by Tom Clegg over 3 years ago
- % Done changed from 0 to 100
- Status changed from In Progress to Resolved
Applied in changeset arvados|bcab6c50f54f155f14d1dd635becd85a5e91289d.
Updated by Ward Vandewege over 3 years ago
- Related to Idea #17755: Test singularity support on a cloud cluster by running some real workflows added