Build docker images as part of a workflow



Container images provide a well-defined execution environment for doing reproducible work. As long as the image is runnable by a container engine, a job can be repeated. However, the point of reproducibility isn't just to allow repetition of the same computation -- it's to make it possible to use prior work as the starting point for future work. Much of this opportunity is lost if the provenance trail ends at a binary image.

Ideally, when a bug is discovered in an analysis tool or library, it should be easy to identify which existing results are affected, and re-run those analyses with the updated software.

Users should have the option of building container images
  • part of a CWL workflow (so they can update the image-building instructions and hit one "re-run" button to see the result)
  • Arvados containers (so the build environment is controlled, build logs are saved, etc.)
  • ...without having docker on the client side (so build-and-run workflows can be initiated from browsers, non-Linux workstations, and shared VM environments)

However, Arvados currently (2022) relies on workstations and shell nodes to build docker images (or download them from external sources) and upload them to Keep before starting a containerized workflow.


1. Migrate docker links to collection properties
  • arv-keepdocker should set collection properties["docker-image-repo-tag"] when adding (already done in #16046, #17508)
  • arv-keepdocker should set collection properties["docker-image-hash"] and properties["docker-image-timestamp"]
  • arv-keepdocker should search collections with properties["docker-image-repo-tag"] instead of "docker_image_repo+tag" links, and sort by properties["docker-image-timestamp"]
  • arvados-cwl-runner should search collections with properties["docker-image-repo-tag"] instead of "docker_image_repo+tag" links, and sort by properties["docker-image-timestamp"]
  • RailsAPI "resolve docker image spec to container" code should search collection properties for given repo:tag or hash, instead of searching links, and sort by properties["docker-image-timestamp"]
  • RailsAPI data migration should copy any pre-existing "docker-image-repo+tag" and "-hash" and "-timestamp" values from links into collection properties
2. Support "pull image" container request, #19860
  • Accept as a special case docker_image="arvados/builtin" to mean "builtin command"
  • Builtin command ["docker", "pull", "repo:tag"] causes crunch-run to run docker pull and save the resulting image sha256:*.tar as the output collection instead of running a container
  • mounts hash is expected/required to be empty
  • runtime_constraints.API is expected/required to be true
  • output_path is expected/required to be "/"
  • crunch-run sets output_properties {"docker-image-hash":"...", "docker-image-repo-tag":"repo:tag"}
3. arvados-cwl-runner submits a "pull image" container request when needed
  • i.e., if the requested image is not already available in Keep, and docker is not installed/usable directly (e.g., running in an arvados container)
4. Support "build image" container request
  • Another builtin command: ["docker", "build"]
  • url uses docker syntax to indicate a collection or remote git repo containing Dockerfile
  • environment can be used to pass build args
  • mounts establishes build context (e.g., mount a collection or git tree at "/")
  • If Dockerfile is not at the root of build context, use ["docker", "build", "/path/to/Dockerfile"]
  • output_path is expected/required to be "/"


How do we avoid the situation of copying & modifying an image collection, and unwittingly leaving the properties in place, causing the modified collection to be used unintentionally?

For a docker pull request, should runtime_constraints be automatic (site configurable), or should the client specify? (Consider the case of pulling a 2 GiB image from dockerhub.)

In a docker build request, if Dockerfile says FROM foo/bar and there is already an image in Arvados tagged foo/bar, should that image be used as the build base, or should docker pull foo/bar from dockerhub and use that as the base?

Updated by Tom Clegg about 1 year ago · 7 revisions