Story #5623

[Crunch] Specify Crunch2 features and APIs

Added by Tom Clegg over 4 years ago. Updated over 4 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Crunch
Target version:
Start date:
04/07/2015
Due date:
% Done:

100%

Estimated time:
(Total: 4.00 h)
Story points:
1.0

Description

broad constraints
  • support CWL
  • support other workflow runners like Queue
  • multiple jobs can share nodes
  • reusable subtasks (or whatever they're called)
  • easy debugging
  • no perl
  • loosely coupled with api server, no requirement for crunch-dispatcher to live on the same machine as the api server or require direct sql access

See Jobs API wiki page.


Subtasks

Task #5628: Meeting about crunch strategyResolvedPeter Amstutz

Task #5711: Requirements / API for Crunch v2ResolvedTom Clegg

Task #6104: ReviewResolvedPeter Amstutz


Related issues

Blocks Arvados - Story #4685: [Crunch] CWL prototype workflow runner in ArvadosResolved04/07/2015

History

#1 Updated by Peter Amstutz over 4 years ago

  • Assigned To set to Peter Amstutz

#2 Updated by Peter Amstutz over 4 years ago

In order to completely support CWL workflows out of the box, Crunch will need to provide the following features:

  1. Ability to run Docker containers without an Arvados SDK (or any other assumptions about the software within). This entails:
    1. The ability to specify an arbitrary command line that runs inside the Docker image
    2. The ability to specify environment variables
    3. The ability to specify configuration files that are made available inside the container before the command runs.
    4. The ability to optionally redirect stdin (from a file in Keep) and stdout (to a file in the output directory)
    5. Have an output directory bind-mounted into the container, which is either uploaded on process completion, or is a writable FUSE directory.
  2. The workflow engine needs to have a way to support pluggable expression engines
    1. The workflow engine could execute the expression engine directly via Docker-in-Docker, but that requires a --privileged container which we may want to avoid.
    2. We could dispatch expressions as jobs, but at a significant performance penalty since expressions are intended for very small computations like "x+1"
    3. We could implement a trusted broker which runs containers to evaluate expressions on the workflow's behalf, but without the overhead of a full job dispatch.
  3. CWL supports pulling images from Docker registry, loading images from tarballs, and/or creating images from Dockerfiles. The Workflow engine either needs to be able perform these tasks itself (with Docker-in-Docker), or we need a Docker service that can pull/load/build images and upload them into Arvados on the workflow engine's behalf.

#3 Updated by Tom Clegg over 4 years ago

Peter Amstutz wrote:

In order to completely support CWL workflows out of the box, Crunch will need to provide the following features:

Let's try to back off a bit from "completely" where possible, for the sake of "initial implementation".

  1. Ability to run Docker containers without an Arvados SDK (or any other assumptions about the software within). This entails:
    1. The ability to specify an arbitrary command line that runs inside the Docker image

This seems like it should be easy: e.g., if script isn't a file in /crunch_scripts in the given repo, just use it as the docker CMD. I suppose we'll want to parse it such that "script":"foo; 'bar baz'" results in invoking "docker run foo\; bar\ baz".

I assume it's also important to support a null command line ("just do whatever CMD said in the Dockerfile")?

  1. The ability to specify environment variables

I assume you mean the kind "docker run -e foo=bar" would accomplish. Technically this could be done by putting "-e foo=bar" at the front of the "script" attribute, but that would be ugly. Perhaps better to have a special "ENV" parameter, so {"script_parameters":{"ENV":{"foo":"bar"}}} ...?

  1. The ability to specify configuration files that are made available inside the container before the command runs.

Can you give an example? It seems like the read-only Keep mount already makes this easy. But I'm not sure what you mean by "before the command runs". (The container lasts exactly as long as the command, doesn't it?)

  1. The ability to optionally redirect stdin (from a file in Keep)

I suppose this could also be done in "script":"..." with a shell redirect, but again that would be clunky. Perhaps something like "script_parameters":{"stdin":"/file/in/container/probably/under/keepmount"} would work for this too? It seems better to keep that stdin source inside the container, although the docker invocation for that seems less obvious. (How do you redirect stdin but still run the CMD given in the Dockerfile?) Perhaps we have to settle for really carefully verifying the stdin argument is in the Keep mount?

and stdout (to a file in the output directory)

[How] is "multiple files output by a job" expected to work?

  1. Have an output directory bind-mounted into the container, which is either uploaded on process completion, or is a writable FUSE directory.

The specifics here (API and implementation) will require some thought...

  1. The workflow engine needs to have a way to support pluggable expression engines

It looks like this was just proposed last week -- I assume/hope that means most stuff doesn't rely on it, and we can kick it off our "initial implementation" list?

  1. CWL supports pulling images from Docker registry, loading images from tarballs, and/or creating images from Dockerfiles. The Workflow engine either needs to be able perform these tasks itself (with Docker-in-Docker), or we need a Docker service that can pull/load/build images and upload them into Arvados on the workflow engine's behalf.

How common is it for a CWL spec to build a new image?

For an "initial implementation" can we support just a subset of these capabilities, like "you can specify a docker image tag, for now it only works if you've arv-keep-dockered it ahead of time" ?

#4 Updated by Peter Amstutz over 4 years ago

  • Status changed from New to In Progress

#5 Updated by Tom Clegg over 4 years ago

  • Subject changed from [CWL] Specify expectations for initial implementation of workflow runner to [Crunch] Specify Crunch2 features and APIs needed by CWL workflow runner
  • Category set to Crunch

#6 Updated by Peter Amstutz over 4 years ago

Here's my CWL wishlist / proposed clean sheet redesign of job request record for crunch v2:

  • uuid, owner_uuid, modified_by_client_uuid, modified_by_user_uuid, created_at, modified_at
    • Standard fields
  • name, description
    • User-friendly information about the job
  • state, started_at, finished_at, log
    • Same as current job
  • created_by_job_uuid
    • The job that spawned this job, or null if it is a root job initiated by a user.
  • input_object
    • JSON containing the input object (functionally the same as script_parameters)
  • output_object
    • JSON containing the output object (jobs are no longer required to write to Keep, could also have several fields for multiple output collections.) Changing the basic output type from a collection to a JSON object is important for native CWL support.
  • pure
    • Whether this job can be reused (== "nondeterministic" ref #3555)
  • git_repository
  • git_commit
  • resolved_git_commit
    • Basically same as before, except that the user supplies "git_commit" and the API server fills in "resolved_git_commit" to the full SHA1 hash instead of rewriting the user-supplied field.
  • docker_image
  • resolved_docker_image
    Similar to git, the user supplies docker_image and the API server resolves that to resolved_docker_image. Also this ought to be the Docker image hash, not the collection PDH.
  • git_checkout_dir
  • temp_dir
  • output_dir
  • keep_dir
    • Desired paths inside the docker container where git checkout, temporary directory, output directories and keep mount should go.
  • stdin
    • A file in Keep that should sent to standard input.
  • stdout
    • A filename in the output directory to which standard output should be directed.
  • environment
    • A JSON object with environment variables and values that should be set in the container environment (docker run --env)
  • initial_collection
    • A collection describing the starting contents of the output directory.
  • command
    • A JSON array of strings containing the parameters to the actual executable command line.
  • progress
    • A decimal between 0.0 and 1.0 describing the fraction of work done.
  • runtime_debugging
    • Enable debug logging for the infrastructure (such as arv-mount) (this might get logged privately away from the end user).

#7 Updated by Tom Clegg over 4 years ago

  • Subject changed from [Crunch] Specify Crunch2 features and APIs needed by CWL workflow runner to [Crunch] Specify Crunch2 features and APIs
  • Description updated (diff)
  • Target version changed from 2015-04-29 sprint to 2015-05-20 sprint

#8 Updated by Tom Clegg over 4 years ago

  • Description updated (diff)

#9 Updated by Brett Smith over 4 years ago

  • Target version changed from 2015-05-20 sprint to 2015-06-10 sprint

Assuming no unforeseen wrinkles come up, I would really like to see this done by the end of the 2015-06-10 sprint so we can start writing code in subsequent sprints. It's clear that, as a matter of saving time, this is blocking Crunch v2.

#10 Updated by Peter Amstutz over 4 years ago

  • Assigned To changed from Peter Amstutz to Tom Clegg

#11 Updated by Ward Vandewege over 4 years ago

  • Description updated (diff)

#12 Updated by Tom Clegg over 4 years ago

  • Status changed from In Progress to Resolved

Also available in: Atom PDF