Project

General

Profile

Containers API » History » Version 35

Tom Clegg, 05/26/2016 08:45 PM

1 9 Tom Clegg
{{>TOC}}
2
3 16 Tom Clegg
h1. Containers API (DRAFT)
4 1 Tom Clegg
5 22 Peter Amstutz
See also [[Crunch2_dispatch]]
6
7 16 Tom Clegg
A Container resource is a record of a computational process.
8 1 Tom Clegg
* Its goal is to capture, unambiguously, as much information as possible about the environment in which the process was run. For example, git trees, data collections, and docker images are stored as content addresses. This makes it possible to reason about the difference between two processes, and to replay a process at a different time and place.
9 16 Tom Clegg
* Clients can read Container records, but only the system can create or modify them.
10 1 Tom Clegg
11 16 Tom Clegg
*Note about the term "containers" vs. "jobs" and "services":* Here, we focus on the use of containers as producers of output data. We anticipate extending the feature set to cover service containers as well. The distinguishing feature of a service container is that _having it running_ is inherently valuable because of the way it interacts with the outside world.
12 1 Tom Clegg
13 16 Tom Clegg
A ContainerRequest is a client's expression of interest in knowing the outcome of a computational process.
14
* Typically, in this context the client's description of the process is less precise than a Container: a ContainerRequest describes container _constraints_ which can have different interpretations over time. For example, a ContainerRequest with a @{"kind":"git_tree","commit_range":"abc123..master",...}@ mount might be satisfiable by any of several different source trees, and this set of satisfying source trees can change when the repository's "master" branch is updated.
15
* The system is responsible for finding suitable Containers and assigning them to ContainerRequests. (Currently this is expected to be done synchronously during the containerRequests.create and containerRequests.update transactions.)
16
* A ContainerRequest may indicate that it can _only_ be satisfied by a new Container record (i.e., existing results should not be reused). In this case creating a ContainerRequest amounts to a submission to the container queue. This is appropriate when the purpose of the ContainerRequest is to test whether a process is repeatable.
17
* A ContainerRequest may indicate that it _cannot_ be satisfied by a new Container record. This is an appropriate way to test whether a result is already available.
18 1 Tom Clegg
19 16 Tom Clegg
When the system has assigned a Container to a ContainerRequest, anyone with permission to read the ContainerRequest also has permission to read the Container.
20
21 1 Tom Clegg
h2. Use cases
22
23
h3. Preview
24
25 16 Tom Clegg
Tell me how you would satisfy container request X. Which pdh/commits would be used? Is the satisfying container already started? finished?
26 9 Tom Clegg
27 16 Tom Clegg
h3. Submit a previewed existing container
28 1 Tom Clegg
29 16 Tom Clegg
I'm happy with the already-running/finished container you showed me in "preview". Give me access to that container, its logs, and [when it finishes] its output.
30 1 Tom Clegg
31 16 Tom Clegg
h3. Submit a previewed new container
32 9 Tom Clegg
33 16 Tom Clegg
I'm happy with the new container the "preview" response proposed to run. Run that container.
34 9 Tom Clegg
35 16 Tom Clegg
h3. Submit a new container (disable reuse)
36 9 Tom Clegg
37 16 Tom Clegg
I don't want to use an already-running/finished container. Run a new container that satisfies my container request.
38 9 Tom Clegg
39 16 Tom Clegg
h3. Submit a new duplicate container (disable reuse)
40 9 Tom Clegg
41 16 Tom Clegg
I'm happy with the already-running/finished container you showed me in "preview". Run a new container exactly like that one.
42 9 Tom Clegg
43 16 Tom Clegg
h3. Select a container and associate it with my ContainerRequest
44 9 Tom Clegg
45 16 Tom Clegg
I'm not happy with the container you chose, but I know of another container that satisfies my request. Assuming I'm right about that, attach my ContainerRequest to the existing container of my choice.
46 1 Tom Clegg
47 9 Tom Clegg
h3. Just do the right thing without a preview
48
49 16 Tom Clegg
Satisfy container request X one way or another, and tell me the resulting container's UUID.
50 6 Tom Clegg
51 16 Tom Clegg
h2. ContainerRequest/Container life cycle
52 6 Tom Clegg
53 16 Tom Clegg
Illustrating container re-use and preview facility:
54
# Client ClientA creates a ContainerRequest CRA with priority=0.
55
# Server creates container CX and assigns CX to CRA, but does not try to run CX yet because max(priority)=0.
56
# Client ClientA presents CX to the user. "We haven't computed this result yet, so we'll have to run a new container. Is this OK?"
57
# Client ClientB creates a ContainerRequest CRB with priority=1.
58
# Server assigns CX to CRB and puts CX in the execution queue with priority=1.
59
# Client ClientA updates CRA with priority=2.
60
# Server updates CX with priority=2.
61
# Container CX starts.
62
# Client ClientA updates CRA with priority=0. (This is as close as we get to a "cancel" operation.)
63
# Server updates CX with priority=1. (CRB still wants this container to complete.)
64
# Container CX finishes.
65
# Clients ClientA and ClientB have permission to read CX (ever since CX was assigned to their respective ContainerRequests) as well as its progress indicators, output, and log.
66 1 Tom Clegg
67 16 Tom Clegg
h2. "ContainerRequest" schema
68 1 Tom Clegg
69 6 Tom Clegg
|Attribute|Type|Description|Discussion|Examples|
70
|uuid, owner_uuid, modified_by_client_uuid,  modified_by_user_uuid|string|Usual Arvados model attributes|||
71
|
72
|created_at, modified_at|datetime|Usual Arvados model attributes|||
73 1 Tom Clegg
|
74
|name|string|Unparsed|||
75
|
76 6 Tom Clegg
|description|text|Unparsed|||
77 15 Tom Clegg
|
78 16 Tom Clegg
|properties|object|Client-defined structured data that does not affect how the container is run.|||
79 11 Tom Clegg
|
80 17 Tom Clegg
|state|string|Once a request is committed, the only attributes that can be modified are priority, container_uuid, and container_count_max. A request with @state="Final"@ cannot be modified.||@"Uncommitted"@
81
@"Committed"@
82
@"Final"@|
83 15 Tom Clegg
|
84 16 Tom Clegg
|requesting_container_uuid|string|When the referenced container ends, the container request is automatically cancelled.|Can be null. If changed to a non-null value, it must refer to a container that is running.||
85 1 Tom Clegg
|
86
|container_uuid|uuid|The container that satisfies this container request.|See "methods" below.||
87
|
88 17 Tom Clegg
|container_count_max|positive integer|Maximum number of containers to start ("attempts").|See "methods" below.||
89
|
90 8 Tom Clegg
|mounts|hash|Objects to attach to the container's filesystem and stdin/stdout.
91
Keys starting with a forward slash indicate objects mounted in the container's filesystem.
92
Other keys are given special meanings here.|
93 1 Tom Clegg
We use "stdin" instead of "/dev/stdin" because literally replacing /dev/stdin with a file would have a confusing effect on many unix programs. The stdin feature only affects the standard input of the first process started in the container; after that, the usual rules apply.|
94
<pre>{
95
 "/input/foo":{
96
  "kind":"collection",
97 9 Tom Clegg
  "portable_data_hash":"d41d8cd98f00b204e9800998ecf8427e+0"
98 1 Tom Clegg
 },
99
 "stdin":{
100
  "kind":"collection_file",
101
  "uuid":"zzzzz-4zz18-yyyyyyyyyyyyyyy",
102 8 Tom Clegg
  "path":"/foo.txt"
103 9 Tom Clegg
 },
104 1 Tom Clegg
 "stdout":{
105 31 Tom Clegg
  "kind":"file",
106 1 Tom Clegg
  "path":"/tmp/a.out"
107 9 Tom Clegg
 }
108 1 Tom Clegg
}</pre>|
109 11 Tom Clegg
|
110 16 Tom Clegg
|runtime_constraints|hash|Restrict the container's access to compute resources and the outside world (in addition to its explicitly stated inputs and output).
111 32 Tom Clegg
-- Each key is the name of a capability, like "internet" or "API" or "clock". The corresponding value is @true@ (the capability must be available in the container's runtime environment) or @false@ (must not) or a value or an array of two numbers indicating an inclusive range. Numeric values are given in basic units (e.g., RAM is given in bytes, not KB or MB or MiB). If a key is omitted, availability of the corresponding capability is acceptable but not necessary.|This is a generalized version of "enforce purity restrictions": it is not a claim that the container will be pure. Rather, it helps us control and track runtime restrictions, which can be helpful when reasoning about whether a given container was pure.
112 10 Tom Clegg
-- In the most basic implementation, no capabilities are defined, and the only acceptable value of this attribute is the empty hash.
113
(TC)Should this structure be extensible like mounts?|
114
<pre>
115 1 Tom Clegg
{
116
  "ram":12000000000,
117 34 Tom Clegg
  "vcpus":[1,null],
118 35 Tom Clegg
  "API":true
119 1 Tom Clegg
}</pre>|
120
|
121 14 Tom Clegg
|container_image|string|Docker image repository and tag, docker image hash, collection UUID, or collection PDH.|||
122 1 Tom Clegg
|
123
|environment|hash|environment variables and values that should be set in the container environment (@docker run --env@). This augments and (when conflicts exists) overrides environment variables given in the image's Dockerfile.|||
124
|
125 8 Tom Clegg
|cwd|string|initial working directory, given as an absolute path (in the container) or a path relative to the WORKDIR given in the image's Dockerfile. The default is @"."@.||<pre>"/tmp"</pre>|
126 1 Tom Clegg
|
127
|command|array of strings|Command to execute in the container. Default is the CMD given in the image's Dockerfile.|
128
To use a UNIX pipeline, like "echo foo &#124; tr f b", or to interpolate environment variables, make sure your container image has a shell, and use a command like @["sh","-c","echo $PATH &#124; wc"]@.||
129
|
130
|output_path|string|Path to a directory or file inside the container that should be preserved as container's output when it finishes.|This path _must_ be, or be inside, one of the mount targets.
131 14 Tom Clegg
For best performance, point output_path to a writable collection mount.||
132 1 Tom Clegg
|
133 16 Tom Clegg
|priority|number|Higher number means spend more resources (e.g., go ahead of other queued containers, bring up more nodes).
134
-- Zero means a container should not be run on behalf of this request. (Clients are expected to submit ContainerRequests with zero priority in order to prevew the container that will be used to satisfy it.)
135 17 Tom Clegg
-- Priority is null if and only if @state!="Committed"@.||
136 11 Tom Clegg
null
137 1 Tom Clegg
@0@
138 11 Tom Clegg
@1000.5@
139 1 Tom Clegg
@-1@|
140 11 Tom Clegg
|
141 16 Tom Clegg
|expires_at|datetime|After this time, priority is considered to be zero. If the assigned container is running at that time, the container _may_ be cancelled to conserve resources.||
142 11 Tom Clegg
null
143
@2015-07-01T00:00:01Z@|
144
|
145 16 Tom Clegg
|filters|array|Additional constraints for satisfying the request, given in the same form as the @filters@ parameter accepted by the @containers.list@ API.||
146 1 Tom Clegg
@["created_at","<","2015-07-01T00:00:01Z"]@|
147
|
148
149 16 Tom Clegg
h2. "Container" schema
150 1 Tom Clegg
151 9 Tom Clegg
|Attribute|Type|Description|Discussion|Examples|
152
|
153
|uuid, owner_uuid, created_at, modified_at, modified_by_client_uuid,  modified_by_user_uuid|string|Usual Arvados model attributes|||
154
|
155 33 Tom Clegg
|state|string||See "Container states" below|
156 20 Tom Clegg
@"Queued"@
157 25 Tom Clegg
@"Locked"@
158 20 Tom Clegg
@"Running"@
159
@"Cancelled"@
160
-@"Failed"@-
161
@"Complete"@|
162 1 Tom Clegg
|
163 26 Tom Clegg
|locked_by_uuid|string|UUID of a token, indicating which dispatch process changed state to Locked|If null, any token can be used to lock. If not null, only the indicated token can modify.
164
Is null if and only if state&notin;{"Locked","Running"}||
165 25 Tom Clegg
|
166 28 Tom Clegg
|auth_uuid|string|UUID of a token to be passed into the container itself, used to access Keep-backed mounts, etc.|Is null if and only if state&notin;{"Locked","Running"}||
167
|
168 25 Tom Clegg
|started_at, finished_at, log||Same as Job attributes in Crunch1|||
169 8 Tom Clegg
|
170 16 Tom Clegg
|environment|hash|Must be equal to a ContainerRequest's environment in order to satisfy the ContainerRequest.|(TC)We could offer a "resolve" process here like we do with mounts: e.g., hash values in the ContainerRequest environment could be resolved according to the given "kind". I propose we leave room for this feature but don't add it yet.||
171 9 Tom Clegg
|
172 16 Tom Clegg
|cwd, command, output_path|string|Must be equal to the corresponding values in a ContainerRequest in order to satisfy that ContainerRequest.|||
173 9 Tom Clegg
|
174 16 Tom Clegg
|mounts|hash|Must contain the same keys as the ContainerRequest being satisfied. Each value must be within the range of values described in the ContainerRequest _at the time the Container is assigned to the ContainerRequest._|||
175 14 Tom Clegg
|
176 16 Tom Clegg
|runtime_constraints|hash|Compute resources, and access to the outside world, that are/were available to the container.
177
-- Generally this will contain additional keys that are not present in any corresponding ContainerRequests: for example, even if no ContainerRequests specified constraints on the number of CPU cores, the number of cores actually used will be recorded here.|
178
Permission/access types will change over time and it may be hard/impossible to translate old types to new. Such cases may cause old Containers to be inelegible for assignment to new ContainerRequests.
179
-- (TC)Is it permissible for this to gain keys over time? For example, a container scheduler might not be able to predict how many CPU cores will be available until the container starts.||
180 1 Tom Clegg
|
181 9 Tom Clegg
|output|string|Portable data hash of the output collection.|||
182
|
183 21 Tom Clegg
|exit_code|integer|Process exit code.|Is null if and only if @state!="Complete"@|
184
@null@
185
@0@
186
@1@
187
@129@|
188
|
189 16 Tom Clegg
|-pure-|-boolean-|-The container's output is thought to be dependent solely on its inputs, i.e., it is expected to produce identical output if repeated.-|
190
We want a feature along these lines, but "pure" seems to be a conclusion we can come to after examining various facts -- rather than a property of an individual container execution event -- and it probably needs something more subtle than a boolean.||
191 8 Tom Clegg
|
192 16 Tom Clegg
|container_image|string|Portable data hash of a collection containing the docker image used to run the container.|(TC) *If* docker image hashes can be verified efficiently, we can use the native docker image hash here instead of a collection PDH.||
193 10 Tom Clegg
|
194 8 Tom Clegg
|progress|number|A number between 0.0 and 1.0 describing the fraction of work done.|
195 16 Tom Clegg
If a container submits containers of its own, it should update its own progress as the child containers progress/finish.||
196 8 Tom Clegg
|
197 16 Tom Clegg
|priority|number|Priority assigned by the system, taking into account the priorities of all associated ContainerRequests.|||
198 8 Tom Clegg
199
h2. Mount types
200
201
The "mounts" hash is the primary mechanism for adding data to the container at runtime (beyond what is already in the container image).
202
203
Each value of the "mounts" hash is itself a hash, whose "kind" key determines the handler used to attach data to the container.
204 9 Tom Clegg
205 1 Tom Clegg
|Mount type|@kind@|Expected keys|Description|Examples|Discussion|
206 9 Tom Clegg
|
207
|Arvados data collection|@collection@|
208 16 Tom Clegg
@"portable_data_hash"@ _or_ @"uuid"@ _may_ be provided. If not provided, a new collection will be created. This is useful when @"writable":true@ and the container's @output_path@ is (or is a subdirectory of) this mount target.
209 8 Tom Clegg
@"writable"@ may be provided with a @true@ or @false@ to indicate the path must (or must not) be writable. If not specified, the system can choose.
210 1 Tom Clegg
@"path"@ may be provided, and defaults to @"/"@.|
211 16 Tom Clegg
At container startup, the target path will have the same directory structure as the given path within the collection. Even if the files/directories are writable in the container, modifications will _not_ be saved back to the original collections when the container ends.|
212 8 Tom Clegg
<pre>
213 9 Tom Clegg
{
214 1 Tom Clegg
 "kind":"collection",
215
 "uuid":"...",
216
 "path":"/foo.txt"
217
}
218
219 8 Tom Clegg
{
220 1 Tom Clegg
 "kind":"collection",
221 13 Tom Clegg
 "uuid":"..."
222 8 Tom Clegg
}
223 1 Tom Clegg
</pre>||
224
|
225
|Git tree|@git_tree@|
226 8 Tom Clegg
One of {@"git-url"@, @"repository_name"@, @"uuid"@} must be provided.
227 1 Tom Clegg
One of {@"commit"@, @"revisions"@} must be provided.
228 14 Tom Clegg
"path" may be provided. The default path is "/".|
229 16 Tom Clegg
At container startup, the target path will have the source tree indicated by the given revision. The @.git@ metadata directory _will not_ be available: typically the system will use @git-archive@ rather than @git-checkout@ to prepare the target directory.
230
-- If a value is given for @"revisions"@, it will be resolved to a set of commits (as desribed in the "ranges" section of git-revisions(1)) and the container request will be satisfiable by any commit in that set.
231 14 Tom Clegg
-- If a value is given for @"commit"@, it will be resolved to a single commit, and the tree resulting from that commit will be used.
232 8 Tom Clegg
-- @"path"@ can be used to select a subdirectory or a single file from the tree indicated by the selected commit.
233 1 Tom Clegg
-- Multiple commits can resolve to the same tree: for example, the file/directory given in @"path"@ might not have changed between commits A and B.
234 16 Tom Clegg
-- The resolved mount (found in the Container record) will have only the "kind" key and a "blob" or "tree" key indicating the 40-character hash of the git tree/blob used.|
235 1 Tom Clegg
<pre>
236 8 Tom Clegg
{
237
 "kind":"git_tree",
238
 "uuid":"zzzzz-s0uqq-xxxxxxxxxxxxxxx",
239 1 Tom Clegg
 "commit":"master"
240
}
241
242
{
243
 "kind":"git_tree",
244 8 Tom Clegg
 "uuid":"zzzzz-s0uqq-xxxxxxxxxxxxxxx",
245 5 Tom Clegg
 "commit_range":"bugfix^..master",
246
 "path":"/crunch_scripts/grep"
247 1 Tom Clegg
}
248 8 Tom Clegg
</pre>||
249 1 Tom Clegg
|
250
|Temporary directory|@tmp@|
251 27 Tom Clegg
@"capacity"@: capacity (in bytes) of the storage device.
252
@"device_type"@ (optional, default "network"): one of @{"ram", "ssd", "disk", "network"}@ indicating the acceptable level of performance.|
253
At container startup, the target path will be empty. When the container finishes, the content will be discarded. This will be backed by a storage mechanism no slower than the specified type.|
254 1 Tom Clegg
<pre>
255 11 Tom Clegg
{
256 1 Tom Clegg
 "kind":"tmp",
257 27 Tom Clegg
 "capacity":100000000000
258 1 Tom Clegg
}
259 27 Tom Clegg
260
{
261
 "kind":"fasttmp",
262
 "capacity":1000000000,
263
 "device_type":"ram"
264
}
265 13 Tom Clegg
</pre>||
266 12 Tom Clegg
|
267
|Keep|@keep@|
268
Expose all readable collections via arv-mount.|Requires suitable runtime constraints.|
269
<pre>
270 13 Tom Clegg
{
271 2 Tom Clegg
 "kind":"keep"
272 8 Tom Clegg
}
273 1 Tom Clegg
</pre>||
274
|
275 30 Tom Clegg
|Mounted file or directory|@file@|
276
@"path"@: absolute path (inside the container) of a file or directory that is (or is inside) another mount target.|Can be used for "stdin" and "stdout" targets.|
277
<pre>
278
{
279
 "kind":"file",
280
 "path":"/mounted_tmp/a.out"
281
}
282
</pre>||
283 1 Tom Clegg
|
284 33 Tom Clegg
285
h2. Container states
286
287
|*state*|*significance*|
288
|Queued|Waiting for a dispatcher to lock it and try to run the container.|
289
|Locked|A dispatcher has "taken" the container and is allocating resources for it. The container has not started yet.|
290
|Running|Resources have been allocated and the contained process has been started (or is about to start). Crunch-run _must_ set state to Running _before_ there is any possibility that user code will run in the container.|
291
|Complete|Container was running, and the contained process/command has exited.|
292
|Cancelled|The container did not run long enough to produce an exit code. This includes cases where the container didn't even start, cases where the container was interrupted/killed before it exited by itself (e.g., priority changed to 0), and cases where some problem prevented the system from capturing the contained process's exit status (exit code and output).|
293 1 Tom Clegg
294
h2. Permissions
295
296 16 Tom Clegg
Users own ContainerRequests but the system owns Containers.  Users get permission to read Containers by virtue of linked ContainerRequests.
297 1 Tom Clegg
298
h2. API methods
299
300
Changes from the usual REST APIs:
301
302 16 Tom Clegg
h3. arvados.v1.container_requests.create and .update
303 8 Tom Clegg
304 11 Tom Clegg
These methods can fail when objects referenced in the "mounts" hash do not exist, or the acting user has insufficient permission on them.
305
306
If @state="Uncommitted"@:
307 16 Tom Clegg
* has null @priority@.
308
* can have its @container_uuid@ reset to null by a client.
309 1 Tom Clegg
* can have its @container_uuid@ set to a non-null value by a system process.
310
311
If @state="Committed"@:
312
* has non-null @priority@.
313 17 Tom Clegg
* can have its @priority@ changed (but not to null).
314
* can have its @container_count_max@ changed.
315
* can have its @container_uuid@ changed by the system. (This allows the system to re-attempt a failed container.)
316
* cannot be modified in other ways
317 11 Tom Clegg
318 17 Tom Clegg
If @state="Final"@:
319
* cannot be modified.
320
321 16 Tom Clegg
h3. arvados.v1.container_requests.cancel
322 8 Tom Clegg
323 11 Tom Clegg
Set @priority@ to zero.
324
325 16 Tom Clegg
h3. arvados.v1.container_requests.satisfy
326 11 Tom Clegg
327 16 Tom Clegg
Find or create a suitable container, and update @container_uuid@.
328 11 Tom Clegg
329 16 Tom Clegg
Return an error if @container_uuid@ is not null.
330 11 Tom Clegg
331
Q: Can this be requested during create? Create+satisfy is a common operation so having a way to do it in a single API call might be a worthwhile convenience.
332
333 1 Tom Clegg
Q: Better name?
334
335 16 Tom Clegg
h3. arvados.v1.containers.create and .update
336 1 Tom Clegg
337
These methods are not callable except by system processes.
338
339 16 Tom Clegg
h3. arvados.v1.containers.progress
340 1 Tom Clegg
341 16 Tom Clegg
This method permits specific types of updates while a container is running: update progress, record success/failure.
342 1 Tom Clegg
343 16 Tom Clegg
Q: [How] can a client submitting container B indicate it shouldn't run unless/until container A succeeds?
344 1 Tom Clegg
345 29 Tom Clegg
h3. arvados.v1.containers.get_auth
346
347
@GET /arvados/v1/containers/{uuid}/get_auth@
348
349
Given the uuid of a container, return the api_client_authorization record indicated by its auth_uuid. The token used to make this request must be the one indicated by the container's locked_by_uuid.
350
351
352 1 Tom Clegg
h2. Debugging
353
354
Q: Need any infrastructure debug-logging controls in this API?
355
356 16 Tom Clegg
Q: Need any container debug-logging controls in this API? Or just use environment vars?
357 1 Tom Clegg
358 16 Tom Clegg
h2. Scheduling and running containers
359 11 Tom Clegg
360 16 Tom Clegg
Q: When/how should we implement a hooks for futures/promises: e.g., "run container Y when containers X0, X1, and X2 have finished"?
361 11 Tom Clegg
362 24 Tom Clegg
(PA) Having a field specifying "wait until time X to run this container" would be generally useful for cron-style tasks.
363
364
365 11 Tom Clegg
h2. Accounting
366
367
A complete design for resource accounting and quota is out of scope here, but we do assert here that the API makes it feasible to retain accounting data.
368
369 16 Tom Clegg
It should be possible to retrieve, for a given container, a complete set of resource allocation intervals, each one including:
370 11 Tom Clegg
* interval start time
371
* interval end time (presented as null or now if the interval hasn't ended yet)
372 1 Tom Clegg
* user uuid
373
* container request id
374
* container request priority
375
* container state
376 17 Tom Clegg
377
h2. TBD
378
379
How does a client get a list of previous (presumably failed) container attempts for a given request?
380 23 Tom Clegg
* Add an array property, like previous_ or attempted_container_uuids?
381
382
(PA) I think we should distinguish between infrastructure failure and task failure by distinguishing between "TempFail" and "PermFail" in the container state. "TempFail" shouldn't count againt the container_count_max count, or alternately we only honor container_count_max for "TempFail" tasks and don't retry "PermFail". Ideally, "TempFail" containers should retry forever, but with a backoff. One way to do the backoff is to schedule the container to run at a specific time in the future.
383 18 Peter Amstutz
384
385
h2. References
386
387
Should consider how this fits in with Kubernetes notion of jobs:
388
389
https://github.com/kubernetes/kubernetes/blob/master/docs/user-guide/jobs.md