Feature #13773
Updated by Tom Clegg over 6 years ago
It is useful to know when a container is going to fail, but hasn't completed yet.
Workflow developers / users want to know this so workflows can be resubmitted.
arvados-cwl-runner wants to use this to avoid reusing an arvados-cwl-runner container which has already decided to fail.
h2. Proposed implementation
h3. API
Add a @runtime_status@ serialized hash attribute to @containers@ model on API server, stored as an indexed jsonb column.
* runtime_status can be updated when state∈{"Locked", "Running"}.
* runtime_status is cleared if state changes from "Locked" to "Queued" (to avoid leaking status messages between different dispatch attempts).
If a container with state=="Running" has an @error@ key in its @runtime_status@ then it must not be a candidate for reuse.
h3. Documentation
Well known keys in @runtime_status@ should be documented on the container schema page:
* @error@: string, indicates the container will definitely fail, or has already failed
* @warning@: string, indicates something unusual happened or is currently happening, but isn't considered fatal
* @activity@: string, a message for the end user about what state the container is currently in
h3. arvados-cwl-runner
* store the first fatal error (failed child, error in workflow definition) in @error@
* mention any additional errors ("first error (4 additional errors)"?)
* store jobsComplete / jobsWaiting / jobsFailed
h3. Workbench
If a running container has @error@ or @warning@ in its @runtime_status@, Workbench should flag it with a color/label to distinguish it from the normal "running" state (perhaps also showing the error/warning message in a tooltip) on the dashboard and other summary views.
Workbench should display any @error@ or @warning@ messages prominently in the detailed view.
h3. Additional ideas
These features are anticipated but they are _not_ expected to be included in the initial implementation:
* crunch-dispatch-slurm can update the @activity@ field to indicate "in slurm queue"
* crunch-run can update the @activity@ field to indicate @loading Docker image@ or @uploading output@
* crunch-run or arv-mount can detect likely cache thrashing conditions and generate a warning
* arvados-cwl-runner reports additional structured error details under @errorDetails@ for Workbench to display