Build docker images as part of a workflow¶
(draft)
Background¶
Container images provide a well-defined execution environment for doing reproducible work. As long as the image is runnable by a container engine, a job can be repeated. However, the point of reproducibility isn't just to allow repetition of the same computation -- it's to make it possible to use prior work as the starting point for future work. Much of this opportunity is lost if the provenance trail ends at a binary image.
Ideally, when a bug is discovered in an analysis tool or library, it should be easy to identify which existing results are affected, and re-run those analyses with the updated software.
Users should have the option of building container images- ...as part of a CWL workflow (so they can update the image-building instructions and hit one "re-run" button to see the result)
- ...in Arvados containers (so the build environment is controlled, build logs are saved, etc.)
- ...without having docker on the client side (so build-and-run workflows can be initiated from browsers, non-Linux workstations, and shared VM environments)
However, Arvados currently (2022) relies on workstations and shell nodes to build docker images (or download them from external sources) and upload them to Keep before starting a containerized workflow.
Implementation¶
1. Migrate docker links to collection properties- arv-keepdocker should set collection properties["docker-image-repo-tag"] when adding (already done in #16046, #17508)
- arv-keepdocker should set collection properties["docker-image-hash"] and properties["docker-image-timestamp"]
- arv-keepdocker should search collections with properties["docker-image-repo-tag"] instead of "docker_image_repo+tag" links, and sort by properties["docker-image-timestamp"]
- arvados-cwl-runner should search collections with properties["docker-image-repo-tag"] instead of "docker_image_repo+tag" links, and sort by properties["docker-image-timestamp"]
- RailsAPI "resolve docker image spec to container" code should search collection properties for given repo:tag or hash, instead of searching links, and sort by properties["docker-image-timestamp"]
- RailsAPI data migration should copy any pre-existing "docker-image-repo+tag" and "-hash" and "-timestamp" values from links into collection properties
- Accept as a special case docker_image="arvados/builtin" to mean "builtin command"
- Builtin command
["docker", "pull", "repo:tag"]
causes crunch-run to rundocker pull
and save the resulting imagesha256:*.tar
as the output collection instead of running a container mounts
hash is expected/required to be emptyruntime_constraints.API
is expected/required to be trueoutput_path
is expected/required to be "/"- crunch-run sets output_properties
{"docker-image-hash":"...", "docker-image-repo-tag":"repo:tag"}
- i.e., if the requested image is not already available in Keep, and docker is not installed/usable directly (e.g., running in an arvados container)
- Another builtin command:
["docker", "build"]
- url uses docker syntax to indicate a collection or remote git repo containing Dockerfile
environment
can be used to pass build argsmounts
establishes build context (e.g., mount a collection or git tree at "/")- If Dockerfile is not at the root of build context, use
["docker", "build", "/path/to/Dockerfile"]
output_path
is expected/required to be "/"
TBD¶
How do we avoid the situation of copying & modifying an image collection, and unwittingly leaving the properties in place, causing the modified collection to be used unintentionally?
For a docker pull
request, should runtime_constraints
be automatic (site configurable), or should the client specify? (Consider the case of pulling a 2 GiB image from dockerhub.)
In a docker build
request, if Dockerfile
says FROM foo/bar
and there is already an image in Arvados tagged foo/bar
, should that image be used as the build base, or should docker pull foo/bar
from dockerhub and use that as the base?
Updated by Tom Clegg over 1 year ago · 7 revisions