Containers API » History » Revision 5

« Previous | Revision 5/64 (diff) | Next »
Tom Clegg, 05/20/2015 10:02 PM

Jobs API (DRAFT)¶

"JobRequest" schema¶

Attribute	Type	Description	Discussion	Examples
uuid, owner_uuid, modified_by_client_uuid, modified_by_user_uuid, created_at, modified_at		Standard fields
name, description		User-friendly information about the job	(TC)Does "user friendly" just mean "user controlled", or is Arvados expected to do something here?
job_uuid	uuid	The job that satisfies this job request, or null if a job has not yet been found or queued. Assigned by the system: cannot be modified directly by clients.
input	hash	Hash of arbitrary keys and values.	(TC)Should this be allowed to include collection UUIDs -- like git_revision can be given as a branch name -- which will be resolved to PDHes automatically before the job starts?	{ "foo":"d41d8cd98f00b204e9800998ecf8427e+0", "bar":123 }
pure	boolean	Process is thought to be pure (see below).	(TC)What do we do when given two JobRequests that are identical except that "pure" is different?
git_repository, git_revision	string	Set of git commits suitable for running the job. git_revision can be either a commit or a range -- see `gitrevisions(1)`.	(TC)Perhaps we should take the opportunity to support these semantics on multiple git repositories per job (#3820).
docker_image	string	Docker image repository and tag, docker image hash, collection UUID, or collection PDH.
git_checkout_dir, temp_dir, output_dir, keep_dir	string	Desired paths inside the docker container where git checkout, temporary directory, output directories and keep mount should go.	(TC)What are the defaults? This flexibility seems useful for a job that submits other jobs (like a workflow/pipeline runner) but would be cumbersome to specify every time ("remind me, where does workflow runner X expect its keep mount to be?). (TC)What is the significance of output_dir? [How] does Crunch merge the content of the `output_dir` and the value of the `output` attribute to arrive at the final output?
stdin	string	A file in Keep that should sent to standard input.	(TC)Is this required to be a regular file or can it be a pipe? (TC)If the job does not finish reading it, is that an error, like `set -o pipefail` in bash? (TC)Relationship between stdin and inputs is unclear. Is stdin an additional input, or is it an error to specify a stdin that isn't in a collection mentioned in inputs?	`{pdh}/foo.txt`
stdout	string	A filename in the output directory to which standard output should be directed.	(TC)If this is not given, is stdout sent to stderr/logs as it is now? (TC)Relationship between stdout and output is unclear. If I specify a "stdout" but the job process sets its output by itself, is Crunch expected to clobber that output with the collection resulting from the "stdout" mechanism?
environment	hash	environment variables and values that should be set in the container environment (docker run --env)	(TC)If this contains variables already used by Crunch (TASK_KEEPMOUNT), which has precedence?
initial_collection	uuid	A collection describing the starting contents of the output directory.	(TC)Not a fan of this attribute name. (TC)Is it an error if this collection is not one of the inputs? Or do all provenance queries need to treat this separately? (TC)Perhaps better if each `input` item were available at `{job_workdir}/input/{inputkey}` and the "preload" behavior could be achieved by setting `output_dir` to `input/foo`?
cwd	string	initial working directory, given as an absolute path (in the container) or relative to {job_workdir}. Default "output".		/tmp output input/foo
command	array of strings	parameters to the actual executable command line.	(TC)Possible to specify a pipe, like "echo foo \| tr f b"? Any shell variables supported? Or do you just use `["sh","-c","echo $PATH \| wc"]` if you want a shell?
runtime_debugging	boolean	Enable debug logging for the infrastructure (such as arv-mount) (this might get logged privately away from the end user)	(TC)This doesn't sound like it should be a job attribute. Infrastructure debugging shouldn't require touching users' job records. An analogous user feature would be useful, but perhaps it just boils down to adding DEBUG=1 to `environment`?
priority	number	Higher number means spend more resources (e.g., go ahead of other queued jobs, bring up more nodes)	(TC)Do we need something more subtle than a single number? (TC)What if a high priority job is waiting for a low priority job to finish?	`0`, `1000.5`, `-1`

"Job" schema¶

Attribute	Type	Description	Discussion	Examples
state, started_at, finished_at, log		Same as current job
input, stdin, stdout, environment, initial_collection, cwd, command, runtime_debugging, git_checkout_dir, temp_dir, output_dir, keep_dir		Copied from the relevant JobRequest(s) and made available to the job process.

output	hash	Arbitrary hash provided by the job process.	(PA)Changing the basic output type from a collection to a JSON object is important for native CWL support. (TC)Need examples of how "output is one collection", "output is multiple collections", "output is collections plus other stuff(?)", and "output is other stuff without collections" are to be encoded.
pure	boolean	The job's output is thought to be dependent solely on its inputs (i.e., it is expected to produce identical output if repeated)	(TC)Is this merely an assertion by the submitter? Is the job itself expected to set or reset it? Does the system behave differently while running the job (e.g., different firewall rules, some APIs disabled)? [Under what conditions] is the system allowed to change it from true to false? Is null allowed, presumably signifying "not known"?	`null` (?) `true` `false`
git_commit_sha1	string	Full 40-character commit hash used to run the job.	(TC)Should we store the tree hash as well? Or instead of the commit hash, if we prevent the job from seeing the git metadata, which would be good for reproducibility (consider a job that starts by doing "git checkout master" in its working directory). (TC)Do we need to store git_repository here too? Presumably, the relevant git tree should be in the internal git repository as a prerequisite of Job creation. And if two repositories have the same commit/tree, it shouldn't matter which we pull it from when running the job.
docker_image_pdh	string	Portable data hash of a collection containing the docker image used to run the job.	(TC) If docker image hashes can be verified efficiently, we can use the native docker image hash here instead of a collection PDH.
progress	number	A number between 0.0 and 1.0 describing the fraction of work done.	(TC)How does this relate to child tasks? E.g., is a job supposed to update this itself as its child tasks complete?
priority	number	Highest priority of all associated JobRequests

Permissions¶

Users own JobRequests but the system owns Jobs. Users get permission to read Jobs by virtue of linked JobRequests.

"jobs" API methods¶

TODO: bring this section up to speed with distinct JobRequest and Job records.

Reuse and reproducibility require some changes to the usual REST APIs.

arvados.v1.jobs.create¶

Q: How does "find or create" work?

Q: How does a client submitting job B indicate it shouldn't run unless/until job A succeeds?

arvados.v1.jobs.update¶

Most attributes cannot be changed after a job starts. Some attributes can be changed:

name, description, priority
output, progress, state, finished_at, log (ideally only by the job itself - should this be enforced?)
modified_*
Q: (any more?)

arvados.v1.jobs.get¶

Q: Should this omit mutable attributes when retrieved by a pure job? (Ideally, pure jobs should not be able to retrieve data other than their stated immutable / content-addressed inputs, either through Keep or through the API.)

Scheduling and running jobs¶

Q: If two users submit identical pure jobs and ask to reuse existing jobs, whose token does the job get to use?

Should pure jobs be run as a pseudo-user that is given read access to the relevant objects for the duration of the job? (This would make it safer to share jobs -- see #5823)

Q: If two users submit identical pure jobs with different priority, which priority is used?

Choices include "whichever is greater" and "sum".

Q: If two users submit identical pure jobs and one cancels -- or one user submits two identical jobs and cancels one -- does the work stop, or continue? What do the job records look like after this?

Files (0)

Updated by Tom Clegg over 9 years ago · 64 revisions

Project

General

Profile

Arvados

Wiki