Containers API (DRAFT)

See also Container dispatch

A Container resource is a record of a computational process.
  • Its goal is to capture, unambiguously, as much information as possible about the environment in which the process was run. For example, git trees, data collections, and docker images are stored as content addresses. This makes it possible to reason about the difference between two processes, and to replay a process at a different time and place.
  • Clients can read Container records, but only the system can create or modify them.

Note about the term "containers" vs. "jobs" and "services": Here, we focus on the use of containers as producers of output data. We anticipate extending the feature set to cover service containers as well. The distinguishing feature of a service container is that having it running is inherently valuable because of the way it interacts with the outside world.

A ContainerRequest is a client's expression of interest in knowing the outcome of a computational process.
  • Typically, in this context the client's description of the process is less precise than a Container: a ContainerRequest describes container constraints which can have different interpretations over time. For example, a ContainerRequest with a {"kind":"git_tree","commit_range":"abc123..master",...} mount might be satisfiable by any of several different source trees, and this set of satisfying source trees can change when the repository's "master" branch is updated.
  • The system is responsible for finding suitable Containers and assigning them to ContainerRequests. (Currently this is expected to be done synchronously during the containerRequests.create and containerRequests.update transactions.)
  • A ContainerRequest may indicate that it can only be satisfied by a new Container record (i.e., existing results should not be reused). In this case creating a ContainerRequest amounts to a submission to the container queue. This is appropriate when the purpose of the ContainerRequest is to test whether a process is repeatable.
  • A ContainerRequest may indicate that it cannot be satisfied by a new Container record. This is an appropriate way to test whether a result is already available.

When the system has assigned a Container to a ContainerRequest, anyone with permission to read the ContainerRequest also has permission to read the Container.

Use cases

Preview

Tell me how you would satisfy container request X. Which pdh/commits would be used? Is the satisfying container already started? finished?

Submit a previewed existing container

I'm happy with the already-running/finished container you showed me in "preview". Give me access to that container, its logs, and [when it finishes] its output.

Submit a previewed new container

I'm happy with the new container the "preview" response proposed to run. Run that container.

Submit a new container (disable reuse)

I don't want to use an already-running/finished container. Run a new container that satisfies my container request.

Submit a new duplicate container (disable reuse)

I'm happy with the already-running/finished container you showed me in "preview". Run a new container exactly like that one.

Select a container and associate it with my ContainerRequest

I'm not happy with the container you chose, but I know of another container that satisfies my request. Assuming I'm right about that, attach my ContainerRequest to the existing container of my choice.

Just do the right thing without a preview

Satisfy container request X one way or another, and tell me the resulting container's UUID.

ContainerRequest/Container life cycle

Illustrating container re-use and preview facility:
  1. Client ClientA creates a ContainerRequest CRA with priority=0.
  2. Server creates container CX and assigns CX to CRA, but does not try to run CX yet because max(priority)=0.
  3. Client ClientA presents CX to the user. "We haven't computed this result yet, so we'll have to run a new container. Is this OK?"
  4. Client ClientB creates a ContainerRequest CRB with priority=1.
  5. Server assigns CX to CRB and puts CX in the execution queue with priority=1.
  6. Client ClientA updates CRA with priority=2.
  7. Server updates CX with priority=2.
  8. Container CX starts.
  9. Client ClientA updates CRA with priority=0. (This is as close as we get to a "cancel" operation.)
  10. Server updates CX with priority=1. (CRB still wants this container to complete.)
  11. Container CX finishes.
  12. Clients ClientA and ClientB have permission to read CX (ever since CX was assigned to their respective ContainerRequests) as well as its progress indicators, output, and log.

"ContainerRequest" schema

Attribute Type Description Discussion Examples
uuid, owner_uuid, modified_by_client_uuid, modified_by_user_uuid string Usual Arvados model attributes
created_at, modified_at datetime Usual Arvados model attributes
name string Unparsed
description text Unparsed
properties object Client-defined structured data that does not affect how the container is run.
state string Once a request is committed, the only attributes that can be modified are priority, container_uuid, and container_count_max. A request with state="Final" cannot have any of its functional parts modified (i.e., only name, description, and properties fields can be modified). "Uncommitted"
"Committed"
"Final"
requesting_container_uuid string When the referenced container ends, the container request is automatically cancelled. Can be null. If changed to a non-null value, it must refer to a container that is running.
container_uuid uuid The container that satisfies this container request. See "methods" below.
container_count_max positive integer Maximum number of containers to start ("attempts"). See "methods" below.
mounts hash Objects to attach to the container's filesystem and stdin/stdout.
Keys starting with a forward slash indicate objects mounted in the container's filesystem.
Other keys are given special meanings here.

We use "stdin" instead of "/dev/stdin" because literally replacing /dev/stdin with a file would have a confusing effect on many unix programs. The stdin feature only affects the standard input of the first process started in the container; after that, the usual rules apply.

{
 "/input/foo":{
  "kind":"collection",
  "portable_data_hash":"d41d8cd98f00b204e9800998ecf8427e+0" 
 },
 "stdin":{
  "kind":"collection",
  "uuid":"zzzzz-4zz18-yyyyyyyyyyyyyyy",
  "path":"/foo.txt" 
 },
 "stdout":{
  "kind":"file",
  "path":"/tmp/a.out" 
 }
}
runtime_constraints hash Restrict the container's access to compute resources and the outside world (in addition to its explicitly stated inputs and output).
-- Each key is the name of a capability, like "internet" or "API" or "clock". The corresponding value is true (the capability must be available in the container's runtime environment) or false (must not) or a value or an array of two numbers indicating an inclusive range. Numeric values are given in basic units (e.g., RAM is given in bytes, not KB or MB or MiB). If a key is omitted, availability of the corresponding capability is acceptable but not necessary.

This is a generalized version of "enforce purity restrictions": it is not a claim that the container will be pure. Rather, it helps us control and track runtime restrictions, which can be helpful when reasoning about whether a given container was pure.

{
  "ram":12000000000,
  "vcpus":2,
  "keep_cache_ram":256000000,
  "API":true
}
scheduling_parameters hash Parameters to pass to the container scheduler (e.g., SLURM) when running the container.
{
  "partitions":["fastcpu","vfastcpu"]
}
container_image string Docker image repository and tag, docker image hash, collection UUID, or collection PDH.
environment hash environment variables and values that should be set in the container environment (docker run --env). This augments and (when conflicts exists) overrides environment variables given in the image's Dockerfile.
cwd string initial working directory, given as an absolute path (in the container) or a path relative to the WORKDIR given in the image's Dockerfile. The default is ".".
"/tmp"
command array of strings Command to execute in the container. Default is the CMD given in the image's Dockerfile.
To use a UNIX pipeline, like "echo foo | tr f b", or to interpolate environment variables, make sure your container image has a shell, and use a command like ["sh","-c","echo $PATH | wc"].
output_path string Path to a directory or file inside the container that should be preserved as container's output when it finishes. This path must be, or be inside, one of the mount targets.
For best performance, point output_path to a writable collection mount.
priority integer 0≤N≤1000 Higher number means spend more resources (e.g., go ahead of other queued containers, bring up more nodes).
-- Zero means a container should not be run on behalf of this request. (Clients are expected to submit ContainerRequests with zero priority in order to prevew the container that will be used to satisfy it.)

Priority is ignored when state!="Committed".

null
0
10
1000
expires_at datetime After this time, priority is considered to be zero. If the assigned container is running at that time, the container may be cancelled to conserve resources.
null
2015-07-01T00:00:01Z
use_existing boolean If possible, use an existing (non-failed) container to satisfy the request instead of creating a new one. Default is true
true
false
filters array Additional constraints for satisfying the request, given in the same form as the filters parameter accepted by the containers.list API.
["created_at","<","2015-07-01T00:00:01Z"]
output_name string Name of the output collection that will be created when the container finishes. If null, a unique name will be assigned automatically.
null
"my container output"
output_ttl non-negative integer Desired lifetime of the output collection, in seconds. This is implemented by setting trash_at and delete_at attributes on the output collection. If zero, trash_at and delete_at will be null and the output collection will not be deleted automatically.
0
86400

"Container" schema

Attribute Type Description Discussion Examples
uuid, owner_uuid, created_at, modified_at, modified_by_client_uuid, modified_by_user_uuid string Usual Arvados model attributes
state string See "Container states" below
"Queued"
"Locked"
"Running"
"Cancelled"
"Failed"
"Complete"
locked_by_uuid string UUID of a token, indicating which dispatch process changed state to Locked If null, any token can be used to lock. If not null, only the indicated token can modify.
Is null if and only if state∉{"Locked","Running"}
auth_uuid string UUID of a token to be passed into the container itself, used to access Keep-backed mounts, etc. Is null if and only if state∉{"Locked","Running"}
started_at, finished_at, log Same as Job attributes in Crunch1
environment hash Must be equal to a ContainerRequest's environment in order to satisfy the ContainerRequest. (TC)We could offer a "resolve" process here like we do with mounts: e.g., hash values in the ContainerRequest environment could be resolved according to the given "kind". I propose we leave room for this feature but don't add it yet.
cwd, command, output_path string Must be equal to the corresponding values in a ContainerRequest in order to satisfy that ContainerRequest.
mounts hash Must contain the same keys as the ContainerRequest being satisfied. Each value must be within the range of values described in the ContainerRequest at the time the Container is assigned to the ContainerRequest.
runtime_constraints hash Compute resources, and access to the outside world, that are/were available to the container.
-- Generally this will contain additional keys that are not present in any corresponding ContainerRequests: for example, even if no ContainerRequests specified constraints on the number of CPU cores, the number of cores actually used will be recorded here.

Permission/access types will change over time and it may be hard/impossible to translate old types to new. Such cases may cause old Containers to be inelegible for assignment to new ContainerRequests.
-- (TC)Is it permissible for this to gain keys over time? For example, a container scheduler might not be able to predict how many CPU cores will be available until the container starts.
scheduling_parameters hash See Container Request schema above.
output string Portable data hash of the output collection.
exit_code integer Process exit code. Is null if and only if state!="Complete"
null
0
1
129
pure boolean The container's output is thought to be dependent solely on its inputs, i.e., it is expected to produce identical output if repeated.
We want a feature along these lines, but "pure" seems to be a conclusion we can come to after examining various facts -- rather than a property of an individual container execution event -- and it probably needs something more subtle than a boolean.
container_image string Portable data hash of a collection containing the docker image used to run the container. (TC) If docker image hashes can be verified efficiently, we can use the native docker image hash here instead of a collection PDH.
progress number A number between 0.0 and 1.0 describing the fraction of work done.
If a container submits containers of its own, it should update its own progress as the child containers progress/finish.
priority number Priority assigned by the system, taking into account the priorities of all associated ContainerRequests.
runtime_status hash Details of the contained process's progress/outcome. Can be updated by the container or the system while state=="Running". If an "error" key exists, the container will not qualify for reuse even if it is still running.
{
  "activity": "flushing logs",
  "error": "error in foo: bar not found" 
}

Mount types

The "mounts" hash is the primary mechanism for adding data to the container at runtime (beyond what is already in the container image).

Each value of the "mounts" hash is itself a hash, whose "kind" key determines the handler used to attach data to the container.

Mount type kind Expected keys Description Examples
Arvados data collection collection
"portable_data_hash", "uuid", or both may be provided in a container request.
If both are provided, the uuid is considered advisory, and the container uses the provided portable_data_hash.
If only the uuid is provided, the container uses the portable data hash corresponding to the given uuid at the time the container is assigned to the container request.
If neither is provided, a new collection is created when the container runs. This is useful when "writable":true and the container's output_path is (or is a subdirectory of) this mount target.
"writable" may be provided with a true or false to indicate the path must (or must not) be writable. If not specified, the system can choose.
"path" may be provided, and defaults to "/".

At container startup, the target path will have the same directory structure as the given path within the collection. Even if the files/directories are writable in the container, modifications will not be saved back to the original collections when the container ends.

{
 "kind":"collection",
 "uuid":"...",
 "path":"/foo.txt" 
}

{
 "kind":"collection",
 "uuid":"..." 
}
Git tree git_tree
One of {"git_url", "repository_name", "uuid"} must be provided.
One of {"commit", "revisions"} must be provided.
"path" may be provided. The default path is "/".

At container startup, the target path will have the source tree indicated by the given revision. The .git metadata directory will not be available: typically the system will use git-archive rather than git-checkout to prepare the target directory.
-- If a value is given for "revisions", it will be resolved to a set of commits (as desribed in the "ranges" section of git-revisions(1)) and the container request will be satisfiable by any commit in that set.
-- If a value is given for "commit", it will be resolved to a single commit, and the tree resulting from that commit will be used.
-- "path" can be used to select a subdirectory or a single file from the tree indicated by the selected commit.
-- Multiple commits can resolve to the same tree: for example, the file/directory given in "path" might not have changed between commits A and B.
-- The resolved mount (found in the Container record) will have only the "kind" key and a "blob" or "tree" key indicating the 40-character hash of the git tree/blob used.

{
 "kind":"git_tree",
 "uuid":"zzzzz-s0uqq-xxxxxxxxxxxxxxx",
 "commit":"master" 
}

{
 "kind":"git_tree",
 "uuid":"zzzzz-s0uqq-xxxxxxxxxxxxxxx",
 "revisions":"bugfix^..master",
 "path":"/crunch_scripts/grep" 
}
Temporary directory tmp
"capacity": capacity (in bytes) of the storage device.
"device_type" (optional, default "network"): one of {"ram", "ssd", "disk", "network"} indicating the acceptable level of performance.

At container startup, the target path will be empty. When the container finishes, the content will be discarded. This will be backed by a storage mechanism no slower than the specified type.

{
 "kind":"tmp",
 "capacity":100000000000
}

{
 "kind":"tmp",
 "capacity":1000000000,
 "device_type":"ram" 
}
Keep keep
Expose all readable collections via arv-mount.
Requires suitable runtime constraints.
{
 "kind":"keep" 
}
Mounted file or directory file
"path": absolute path (inside the container) of a file or directory that is (or is inside) another mount target.
Can be used for "stdin" and "stdout" targets.
{
 "kind":"file",
 "path":"/mounted_tmp/a.out" 
}
JSON document json
A JSON-encoded string, array, or object.

{
 "kind":"json",
 "content":{"foo":"bar"}
}
Text file text
Arbitrary UTF-8 text.
Not suitable for binary data.
{
 "kind":"text",
 "content":"Foo bar.\n" 
}

Container states

state significance allowed next
Queued Waiting for a dispatcher to lock it and try to run the container. Locked, Cancelled
Locked A dispatcher has "taken" the container and is allocating resources for it. The container has not started yet. Queued, Running, Cancelled
Running Resources have been allocated and the contained process has been started (or is about to start). Crunch-run must set state to Running before there is any possibility that user code will run in the container. Complete, Cancelled
Complete Container was running, and the contained process/command has exited. -
Cancelled The container did not run long enough to produce an exit code. This includes cases where the container didn't even start, cases where the container was interrupted/killed before it exited by itself (e.g., priority changed to 0), and cases where some problem prevented the system from capturing the contained process's exit status (exit code and output). -

Permissions

Users own ContainerRequests but the system owns Containers. Users get permission to read Containers by virtue of linked ContainerRequests.

API methods

Changes from the usual REST APIs:

arvados.v1.container_requests.create and .update

These methods can fail when objects referenced in the "mounts" hash do not exist, or the acting user has insufficient permission on them.

These methods accept an optional boolean "satisfy" parameter. If true, and the create/update operation is successful, a "satisfy" API is then called implicitly, and the create/update response reflects the semantics of "satisfy" given below: e.g., it might return a non-200 status (201? 202?) to indicate the container request was created, but has not been satisfied yet: in this case the caller should wait a bit and then call "satisfy" explicitly.

State-dependent validations:

If state="Uncommitted":
  • has null priority.
  • can have its container_uuid reset to null by a client.
  • can have its container_uuid set to a non-null value by a system process.
If state="Committed":
  • has non-null priority.
  • can have its priority changed (but not to null).
  • can have its container_count_max changed.
  • can have its container_uuid changed by the system. (This allows the system to re-attempt a failed container.)
  • can have its name, description, and properties changed.
  • cannot be modified in other ways.
If state="Final":
  • can have its name, description, and properties changed.
  • cannot be modified in other ways.

arvados.v1.container_requests.cancel

Set priority to zero.

arvados.v1.container_requests.satisfy

If container_uuid is null, find or create a suitable container, and update container_uuid.

If container_uuid is not null, respond immediately.

Return a retryable error if the container is not known to be unsatisfiable, but was not satisfied in time to respond to this API request. IOW, clients should be prepared to poll until the container is satisfied.

The premise is that "create container request" should be able to return quickly, even if the system needs some time to decide how/whether to satisfy the new CR -- but it should also be easy to write a client that submits a ContainerRequest and then waits for a Container to be assigned.

This behavior can also be requested at creation time; see "create" above.

Q: Better name?

arvados.v1.containers.create and .update

These methods are not callable except by system processes.

arvados.v1.containers.progress

This method permits the container itself (using the token indicated by auth_uuid) to update the progress field.

arvados.v1.containers.auth

GET /arvados/v1/containers/{uuid}/auth

Given the uuid of a container, return the api_client_authorization record indicated by its auth_uuid. The token used to make this request must be the one indicated by the container's locked_by_uuid.

Debugging

Q: Need any infrastructure debug-logging controls in this API?

Q: Need any container debug-logging controls in this API? Or just use environment vars?

Scheduling and running containers

Q: When/how should we implement a hooks for futures/promises: e.g., "run container Y when containers X0, X1, and X2 have finished"?

(PA) Having a field specifying "wait until time X to run this container" would be generally useful for cron-style tasks.

Accounting

A complete design for resource accounting and quota is out of scope here, but we do assert here that the API makes it feasible to retain accounting data.

It should be possible to retrieve, for a given container, a complete set of resource allocation intervals, each one including:
  • interval start time
  • interval end time (presented as null or now if the interval hasn't ended yet)
  • user uuid
  • container request id
  • container request priority
  • container state

TBD

How does a client get a list of previous (presumably failed) container attempts for a given request?
  • Add an array property, like previous_ or attempted_container_uuids?
Classifying failure/error modes
  • (PA) I think we should distinguish between infrastructure failure and task failure by distinguishing between "TempFail" and "PermFail" in the container state. "TempFail" shouldn't count againt the container_count_max count, or alternately we only honor container_count_max for "TempFail" tasks and don't retry "PermFail". Ideally, "TempFail" containers should retry forever, but with a backoff. One way to do the backoff is to schedule the container to run at a specific time in the future.
  • (TC) Classifying failure modes sounds useful, but I think it's wrong to overload the container state field with this information. State should represent the state of the container, not an assessment of how it got into that state. "Success/failure" has no bearing on what state the container can be in next, for example. If anything, I'd consider consolidating "Cancelled" and "Complete" (as "Stopped"?) rather than loading more information into the state field.
  • (TC) The "temporary/permanent" distinction seems orthogonal to the "infrastructure/user-code" distinction. E.g., if a container cannot run because the static physical hardware does not have enough memory, we shouldn't retry. E.g., if a container fails because the user code timed out trying to read from Keep, retrying would be worthwhile. It seems hard (impossible?) for us to determine automatically (reliably) whether an infrastructure problem is the root cause of a given container's non-zero exit code, and whether there's a reasonable chance retrying now will avoid hitting the same infrastructure problem.
  • (TC) The concept of "retry" seems to belong in ContainerRequest, not Container. A Container is just a container; if you "run something again", you've got a new container.

References

Should consider how this fits in with Kubernetes notion of jobs:

https://github.com/kubernetes/kubernetes/blob/master/docs/user-guide/jobs.md