Story #2492

Run Job tasks in a Docker container

Added by Brett Smith over 7 years ago. Updated over 7 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Brett Smith
Category:
-
Start date:
05/05/2014
Due date:
% Done:

100%

Estimated time:
(Total: 8.00 h)
Story points:
1.0

Description

Write a tool that can set up an appropriate environment to run a Job. At the latest, Crunch v2 will use this to actually run those Jobs in a more stable, predictable environment.


Subtasks

Task #2632: Test and refine crunch-job docker_image patch in the staging environmentResolvedBrett Smith

Task #2634: Document the docker_image runtime constraintResolvedBrett Smith

Task #2734: Review 2492-docker-crunch-jobsResolvedPeter Amstutz

Associated revisions

Revision 222ce386
Added by Brett Smith over 7 years ago

Merge branch '2492-docker-crunch-jobs'

Closes #2492.

History

#1 Updated by Brett Smith over 7 years ago

  • Description updated (diff)
  • Assigned To set to Brett Smith

#2 Updated by Brett Smith over 7 years ago

Interesting issue: right now we can't build one Dockerfile that accommodates arv-crunch-job. That calls arv-mount, which needs FUSE, which needs /dev/fuse, and Docker just gained the ability to mknod inside a Dockerfile.

We have a few options:

  • Use one of the hacky solutions for mknod that people have been doing to date—most likely, with docker run --privileged in the Makefile.
  • Rearchitect Crunch so that the mount always lives on the compute node, and then expose it to the Job container as a volume.
  • Wait for this PR to make it to release, and then rely on it.

#3 Updated by Tom Clegg over 7 years ago

  • Subject changed from Run Jobs in a Docker container to Run Job tasks in a Docker container

#4 Updated by Tom Clegg over 7 years ago

  • Status changed from New to In Progress

#5 Updated by Brett Smith over 7 years ago

  • Project changed from Arvados to Arvados Private
  • Status changed from In Progress to New
  • Target version deleted (2014-04-16 Dev tools and data/resource management)

The branch 2492-docker-crunch-jobs has a Dockerfile with all the SDKs installed, as well as a proposed patch to crunch-job to support a specified docker_image as a runtime constraint. I'm coordinating with Ward to test this in the staging environment. It requires a new Linux, so that's at least a little involved.

#6 Updated by Brett Smith over 7 years ago

  • Project changed from Arvados Private to Arvados
  • Status changed from New to In Progress
  • Target version set to 2014-04-16 Dev tools and data/resource management

#7 Updated by Tom Clegg over 7 years ago

  • Target version changed from 2014-04-16 Dev tools and data/resource management to 2014-05-07 Storing and Organizing Data

#8 Updated by Brett Smith over 7 years ago

  • Estimated time set to 8.00 h
  • Story points changed from 2.0 to 1.0

Updated numbers for this sprint.

#9 Updated by Peter Amstutz over 7 years ago

  1. If the user uses a symbolic name for the docker image, can we resolve that to a hash and record the hash for the job, like we do for script versions?

#10 Updated by Brett Smith over 7 years ago

Peter Amstutz wrote:

  1. If the user uses a symbolic name for the docker image, can we resolve that to a hash and record the hash for the job, like we do for script versions?

Definitely this represents the ultimate direction we want to head with Docker, applying the same job reuse logic to images that we do to script versions. And the output of docker.io images --no-trunc is easy enough to parse to do the translation.

Unfortunately, we have a logistical snafu in that the necessary information is far away from the API server. Docker is only installed on the compute nodes, which the API server can only interact with via SLURM. And as far as I can tell, the only way to find out from the command line if a new image is available from the repository is to try to pull it. All this means that it wouldn't be too difficult to record the information, but then trying to use that hash to figure out job reuse would be unreasonably expensive: you'd have to wait for a compute node to become available, run docker.io pull on it, wait for a potentially lengthy install to finish, and then see if the image hash changed.

Our story around Docker image management still needs to be hammered out. We've talked generally about storing those images in Keep, and then identifying them by their Collection hash, which would go much farther to enable the kind of smarts you're anticipating. I fully expect we'll do that, and I think it's a story for a future sprint. This story is first about building the arvados/jobs image, and second about providing a consistent environment for Jobs. Containerizing Jobs in crunch-job is something we knew we wanted, and let us prove that arvados/jobs works to spec. The docker pull logic in it is more of a stopgap, the quickest way to get the desired image on all the compute nodes. This means support for provenance is admittedly not fully baked yet, and I think solving that means more consideration about how Docker images live in Arvados as a whole. Any halfway effort to address it now will probably get replaced when that happens.

tl;dr: It's a great idea, but I'm unsure now is the right time.

(Writing this up made me realize we have a bug in that specifying an image hash is fine for docker.io run but not docker.io pull. I'll have to figure out a bugfix for that.)

#11 Updated by Brett Smith over 7 years ago

Brett Smith wrote:

(Writing this up made me realize we have a bug in that specifying an image hash is fine for docker.io run but not docker.io pull. I'll have to figure out a bugfix for that.)

Did the simplest possible thing in 2e31424. It's ready for another look.

#12 Updated by Brett Smith over 7 years ago

  • Status changed from In Progress to Resolved

Applied in changeset arvados|commit:222ce386e36b3d146e718a5d2f64a95fb30996bb.

Also available in: Atom PDF