Feature #13773
closed"Will fail" status for failing (but not yet failed) containers
Description
It is useful to know when a container is going to fail, but hasn't completed yet.
Workflow developers / users want to know this so workflows can be resubmitted.
arvados-cwl-runner wants to use this to avoid reusing an arvados-cwl-runner container which has already decided to fail.
Proposed implementation¶
API¶
Add aruntime_status
serialized hash attribute to containers
model on API server, stored as an indexed jsonb column.
- runtime_status can be updated when state∈{"Locked", "Running"}.
- runtime_status is cleared if state changes from "Locked" to "Queued" (to avoid leaking status messages between different dispatch attempts).
If a container with state=="Running" has an error
key in its runtime_status
then it must not be a candidate for reuse. (Note "Locked" state is deliberately omitted here because dispatch/setup errors are retryable.)
Documentation¶
Well known keys inruntime_status
should be documented on the container schema page:
error
: string, indicates the container will definitely fail, or has already failedwarning
: string, indicates something unusual happened or is currently happening, but isn't considered fatalactivity
: string, a message for the end user about what state the container is currently in
arvados-cwl-runner¶
- store the first fatal error (failed child, error in workflow definition) in
error
- (secondary goal) mention any additional errors ("first error (4 additional errors)"?)
- (secondary goal) store jobsComplete / jobsWaiting / jobsFailed
Workbench¶
If a running container has error
or warning
in its runtime_status
, Workbench should flag it with a color/label to distinguish it from the normal "running" state (perhaps also showing the error/warning message in a tooltip) on the dashboard and other summary views.
Workbench should display any error
or warning
messages prominently in the detailed view.
Additional ideas¶
These features are anticipated but they are not expected to be included in the initial implementation:- crunch-dispatch-slurm can update the
activity
field to indicate "in slurm queue" - crunch-run can update the
activity
field to indicateloading Docker image
oruploading output
- crunch-run or arv-mount can detect likely cache thrashing conditions and generate a warning
- arvados-cwl-runner reports additional structured error details under
errorDetails
for Workbench to display
Files
Updated by Peter Amstutz over 6 years ago
- Status changed from New to In Progress
Updated by Peter Amstutz over 6 years ago
- Subject changed from "Will fail" status to prevent reuse of failing (but not yet failed) container to "Will fail" status for failing (but not yet failed) containers
- Description updated (diff)
- Status changed from In Progress to New
Updated by Peter Amstutz over 6 years ago
Introduce a "status" field with some well defined fields.
Updated by Tom Morris over 6 years ago
- Target version changed from To Be Groomed to Arvados Future Sprints
- Story points set to 3.0
Updated by Tom Morris over 6 years ago
- Target version changed from Arvados Future Sprints to 2018-08-01 Sprint
Updated by Lucas Di Pentima over 6 years ago
- Assigned To set to Lucas Di Pentima
Updated by Peter Amstutz over 6 years ago
The idea of having error/warning be a list was is to help communicate multiple errors (eg steps 2, 5 and 7 failed).
It should be possible to update runtime_status when in "Locked" state, because currently the locked state includes the container initialization (we'd like to communicate things like "current activity is loading Docker image".
Updated by Tom Clegg over 6 years ago
- Description updated (diff)
- choose an appropriate way to report the multiple-error condition to a user, and store that string in the "error" field (e.g., "first error", "last error", "first error (N more errors)")
- optionally store an array of individual errors (and any other error details) under a different key like errorDetails.
Updated description to allow updates when state=Locked, and clear when state=Queued.
Updated by Peter Amstutz over 6 years ago
Tom Clegg wrote:
Regarding multiple errors: the "error" or "warning" values should be a single human-readable summary of the error/warning condition. Rather than giving an array of values and letting Workbench guess which ones are most important to display, clients should
- choose an appropriate way to report the multiple-error condition to a user, and store that string in the "error" field (e.g., "first error", "last error", "first error (N more errors)")
- optionally store an array of individual errors (and any other error details) under a different key like errorDetails.
So, my thought was that if it is an array of errors, workbench can just present it as a bulleted list. if it is a single string, and there are multiple errors, then the client could format them, but then we need to define how workbench is supposed to handle the formatting. Maybe allow textile in that field?
Updated by Tom Clegg over 6 years ago
I still think the "error" value, which all presentation code is expected to handle, is most useful as a plain old string. This makes it usable in a terminal window, a tooltip, a program that wants to wrap an error ("baz workflow failed: $x"), etc. The reporting client is welcome to store additional structured information in other keys -- but it still has to come up with a string, too. (If it's hard for arvados-cwl-runner to summarize its own multiple-error conditions usefully, I don't think it will get any easier in Workbench...)
Updated by Peter Amstutz over 6 years ago
Tom Clegg wrote:
I still think the "error" value, which all presentation code is expected to handle, is most useful as a plain old string. This makes it usable in a terminal window, a tooltip, a program that wants to wrap an error ("baz workflow failed: $x"), etc. The reporting client is welcome to store additional structured information in other keys -- but it still has to come up with a string, too. (If it's hard for arvados-cwl-runner to summarize its own multiple-error conditions usefully, I don't think it will get any easier in Workbench...)
So assume preformatted plain text (ie respect all line breaks)? That's fine, we just need to be clear about it. One of the goals here is for arvados-cwl-runner to be able to surface errors which came from a failing step instead of making the user sift through logs. However right now that means returning the last 40 lines of the step's stdout/stderr (because we don't know exactly where the error message is.)
Updated by Peter Amstutz over 6 years ago
Maybe think about what how we want errors to be presented on workbench, and then work back from there? Separate "error" (one line summary) and "error_details" (longer, formatted) would make sense.
Updated by Lucas Di Pentima over 6 years ago
- Status changed from New to In Progress
Updated by Lucas Di Pentima over 6 years ago
- Target version changed from 2018-08-01 Sprint to 2018-08-15 Sprint
Updated by Lucas Di Pentima over 6 years ago
- Target version changed from 2018-08-15 Sprint to 2018-09-05 Sprint
Updated by Lucas Di Pentima over 6 years ago
- Target version changed from 2018-09-05 Sprint to 2018-09-19 Sprint
Updated by Lucas Di Pentima over 6 years ago
Updates at 4a466ccc5 - branch 13773-will-fail-container-status
Test run: https://ci.curoverse.com/job/developer-run-tests/876/
- Add
runtime_status
Hash attribute to containers on the API Server - Avoid container reuse when there's an
error
key onruntime_status
- Add documentation about
error
,errorDetails
,warning
&activity
keys - On
a-c-r
, updateruntime_status
on runner containers when a child fails, including the child's name & uuid on theerror
key and the last 40 log lines onerrorDetails
.- Add/update
(and N more)
string on theerror
key when more than 1 child fails.
- Add/update
- On workbench, show "Failing" and "Warning" labels on running containers with relevant
runtime_status
information. On the dashboard also show a tooltip with error/warning messages and on the detailed view show a panel, includingerrorDetails
in the error case. - Add tests on API Server &
arvados-cwl-runner
. Workbench tests pending.
Updated by Peter Amstutz over 6 years ago
If only certain errors get recorded, people are going to be confused and we'll be fielding support requests from users who don't know when to look in the logs tab. For example, the current implementation wouldn't show CWL syntax errors in runtime_status
.
In arvados-cwl-runner, how hard would it be to intercept logging.exception(), logging.error(), and logging.warning() and have that call runtime_status_error()
? I think this is mostly a matter of configuring the root logger with a custom output handler.
Updated by Peter Amstutz over 6 years ago
In workbench:
<div id="errorDetail" class="collapse"> <pre><%= wu.runtime_status[:errorDetail] %></pre> </div>
Does this sanitize :errorDetail (and other places where the :error message is rendered)?
Minor documentation nitpicking:
"error": "This container won't be successful because at least one step have already failed."
Should be "has already failed."
h2(#runtime_status). {% include 'container_runtime_status' %}
This doesn't need to be included from a separate file, the reason the other sections are that way is because they are included in both containers and container_requests, but this one is container specific.
Updated by Peter Amstutz over 6 years ago
Workbench presentation:
- I suggest using a similar layout to the workflow steps. This would be a panel where the title of the panel is the error message followed by a little down-pointing triangle. The body of the panel will be initially hidden, and clicking anywhere on the error message opens the panel body to display the full error.
- If there is a warning, it should always be displayed (not if/else with error).
- The value of "activity" isn't displayed on Workbench at all.
- The API server should validate that
error
,errorDetails
,warning
andactivity
are strings in runtime_status, if present. - Should we add "warningDetails"?
Nitpicking aside, this looks really good, I tried it out with a test case (a workflow with one failing and one passing step) and it worked exactly as expected.
Updated by Lucas Di Pentima over 6 years ago
I've written a log handler that intercepts log messages and updates runtime_status
as an error/warning/activity status. When the log string contains a '\n'
character, it uses it as a "kindDetail" field. The issue that I'm now having is on the error case: when a child container fails, the error log on the runner may arrive before the child running its runner's update status, so the errorDetail
that we got from the 40 lines of error logs now are lost because only the first error occurrence is saved.
I've attached an example screenshot to also show how errors and warnings are now displayed simultaneously. I didn't make it work this way at first because I supposed that if there's an error, a warning is superfluous.
Regarding the activity status, do you have something in mind as which UI element to be used on the detailed/dashboard views?
Updated by Peter Amstutz over 6 years ago
My idea was that you'd take the first line of the first error message to use as "error", and then all subsequent error logging would get appended to errorDetail so that everything noteworthy is being displayed.
Warnings should handled the same way.
Activity should only be the most recent status. I don't think it makes sense for activity to be the same as the "info" log level, it should probably be better thought out as its own feature (and maybe incorporate updating the "progress" field as well.)
Updated by Peter Amstutz over 6 years ago
Tom has pointed out that detailed error reporting is out of scope for the ticket, so disregard my comments, let's merge the minimal version of this feature.
Updated by Lucas Di Pentima over 6 years ago
Updates at 3c08b21fd
Test run: https://ci.curoverse.com/job/developer-run-tests/887/
To avoid scope creeping, the updates include:
- Intercept logging (warning & error) messages to save them as runtime status messages. This logging handler will populate the [error|warning]Detail field when the logged message is multi-line
- Update documentation as suggested
- Update a-c-r test so that it checks the logging handler does its task
- Enhance runtime status error/warning display on workbench's container detailed page
- Sanitize runtime status information when displaying it on workbench
- Add
runtime_status
data type validation on the api server (w/test)
Updated by Lucas Di Pentima over 6 years ago
Updates at 58f3f7ab5
Test run: https://ci.curoverse.com/job/developer-run-tests/889/
Fixed test that failed when running the entire sdk/cwl
suite.
Updated by Peter Amstutz over 6 years ago
Nitpick, "levelno" is an integer, and (ERROR, WARNING, INFO) are something like (40, 30, 20) so I think you could use >=
instead (in fact, there's a "CRITICAL" logging level which is levelno=50)
if record.levelno == logging.ERROR: kind = 'error' elif record.levelno == logging.WARNING: kind = 'warning' elif record.levelno == logging.INFO: kind = 'activity'
I tried running a workflow. I think there's a bug. I ran a test workflow and it never submitted a child workflow. The last update to "activity" does not correspond to the last INFO line of stderr. I'm suspicious that something in the new logging may have failed silently and left the workflow runner deadlocked.
2018-09-13T18:34:51.201134723Z cwltool INFO: /usr/local/bin/arvados-cwl-runner 1.2.0.20180907205721, arvados-python-client 1.2.0.20180824213733, cwltool 1.0.20180806194258 2018-09-13T18:34:51.357159071Z cwltool INFO: Resolved '/var/lib/cwl/workflow.json#main' to 'file:///var/lib/cwl/workflow.json#main' 2018-09-13T18:34:53.259160187Z arvados.arv-run INFO: Using empty collection d41d8cd98f00b204e9800998ecf8427e+0 2018-09-13T18:34:53.389896614Z arvados.arv-run INFO: Using empty collection d41d8cd98f00b204e9800998ecf8427e+0 2018-09-13T18:34:53.548907141Z arvados.arv-run INFO: Using empty collection d41d8cd98f00b204e9800998ecf8427e+0 2018-09-13T18:34:53.744135935Z arvados.arv-run INFO: Using empty collection d41d8cd98f00b204e9800998ecf8427e+0 2018-09-13T18:34:53.902239799Z cwltool INFO: [workflow workflow.json#main] start 2018-09-13T18:34:53.902474964Z cwltool INFO: [workflow workflow.json#main] starting step s1
"runtime_status": { "activity": "arvados.arv-run: Using empty collection d41d8cd98f00b204e9800998ecf8427e+0" }
Updated by Peter Amstutz over 6 years ago
Attached is the test workflow I used.
Updated by Lucas Di Pentima over 6 years ago
Update at 6451f2e4a
Test run: https://ci.curoverse.com/job/developer-run-tests/892/
Updated logging level comparison as requested by Peter. Both ERROR & CRITICAL logging will go to error
runtime_status.
Updated by Peter Amstutz over 6 years ago
Lucas Di Pentima wrote:
Update at 6451f2e4a
Test run: https://ci.curoverse.com/job/developer-run-tests/892/Updated logging level comparison as requested by Peter. Both ERROR & CRITICAL logging will go to
error
runtime_status.
Thanks.
I was not on the latest branch when I ran the test workflow. On the latest branch, it worked as expected.
This LGTM.
Updated by Lucas Di Pentima over 6 years ago
Merged master
into the branch, resolved a migration conflict at 30d37841e
Running tests one last time before merging: https://ci.curoverse.com/job/developer-run-tests/893/
Updated by Lucas Di Pentima over 6 years ago
- Status changed from In Progress to Resolved
Applied in changeset arvados|1010c9e519409f0eb36971df6c382fc7cc0e2732.
Updated by Tom Clegg almost 6 years ago
- Related to Bug #13772: Rerunning a container_request that has a failed child CR should restart the failed CR added