Feature #13773
closed"Will fail" status for failing (but not yet failed) containers
Description
It is useful to know when a container is going to fail, but hasn't completed yet.
Workflow developers / users want to know this so workflows can be resubmitted.
arvados-cwl-runner wants to use this to avoid reusing an arvados-cwl-runner container which has already decided to fail.
Proposed implementation¶
API¶
Add aruntime_status
serialized hash attribute to containers
model on API server, stored as an indexed jsonb column.
- runtime_status can be updated when state∈{"Locked", "Running"}.
- runtime_status is cleared if state changes from "Locked" to "Queued" (to avoid leaking status messages between different dispatch attempts).
If a container with state=="Running" has an error
key in its runtime_status
then it must not be a candidate for reuse. (Note "Locked" state is deliberately omitted here because dispatch/setup errors are retryable.)
Documentation¶
Well known keys inruntime_status
should be documented on the container schema page:
error
: string, indicates the container will definitely fail, or has already failedwarning
: string, indicates something unusual happened or is currently happening, but isn't considered fatalactivity
: string, a message for the end user about what state the container is currently in
arvados-cwl-runner¶
- store the first fatal error (failed child, error in workflow definition) in
error
- (secondary goal) mention any additional errors ("first error (4 additional errors)"?)
- (secondary goal) store jobsComplete / jobsWaiting / jobsFailed
Workbench¶
If a running container has error
or warning
in its runtime_status
, Workbench should flag it with a color/label to distinguish it from the normal "running" state (perhaps also showing the error/warning message in a tooltip) on the dashboard and other summary views.
Workbench should display any error
or warning
messages prominently in the detailed view.
Additional ideas¶
These features are anticipated but they are not expected to be included in the initial implementation:- crunch-dispatch-slurm can update the
activity
field to indicate "in slurm queue" - crunch-run can update the
activity
field to indicateloading Docker image
oruploading output
- crunch-run or arv-mount can detect likely cache thrashing conditions and generate a warning
- arvados-cwl-runner reports additional structured error details under
errorDetails
for Workbench to display
Files
Updated by Peter Amstutz almost 7 years ago
- Subject changed from "Will fail" status to prevent reuse of failing (but not yet failed) container to "Will fail" status for failing (but not yet failed) containers
- Description updated (diff)
- Status changed from In Progress to New
Updated by Tom Morris almost 7 years ago
- Target version changed from To Be Groomed to Arvados Future Sprints
- Story points set to 3.0
Updated by Tom Morris almost 7 years ago
- Target version changed from Arvados Future Sprints to 2018-08-01 Sprint
Updated by Lucas Di Pentima over 6 years ago
- Target version changed from 2018-08-01 Sprint to 2018-08-15 Sprint
Updated by Lucas Di Pentima over 6 years ago
- Target version changed from 2018-08-15 Sprint to 2018-09-05 Sprint
Updated by Lucas Di Pentima over 6 years ago
- Target version changed from 2018-09-05 Sprint to 2018-09-19 Sprint
Updated by Lucas Di Pentima over 6 years ago
Updated by Tom Clegg about 6 years ago
- Related to Bug #13772: Rerunning a container_request that has a failed child CR should restart the failed CR added