Feature #13773

"Will fail" status for failing (but not yet failed) containers

Added by Peter Amstutz about 1 year ago. Updated 9 months ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Start date:
09/06/2018
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
3.0
Release:
Release relationship:
Auto

Description

It is useful to know when a container is going to fail, but hasn't completed yet.

Workflow developers / users want to know this so workflows can be resubmitted.

arvados-cwl-runner wants to use this to avoid reusing an arvados-cwl-runner container which has already decided to fail.

Proposed implementation

API

Add a runtime_status serialized hash attribute to containers model on API server, stored as an indexed jsonb column.
  • runtime_status can be updated when state∈{"Locked", "Running"}.
  • runtime_status is cleared if state changes from "Locked" to "Queued" (to avoid leaking status messages between different dispatch attempts).

If a container with state=="Running" has an error key in its runtime_status then it must not be a candidate for reuse. (Note "Locked" state is deliberately omitted here because dispatch/setup errors are retryable.)

Documentation

Well known keys in runtime_status should be documented on the container schema page:
  • error: string, indicates the container will definitely fail, or has already failed
  • warning: string, indicates something unusual happened or is currently happening, but isn't considered fatal
  • activity: string, a message for the end user about what state the container is currently in

arvados-cwl-runner

  • store the first fatal error (failed child, error in workflow definition) in error
  • (secondary goal) mention any additional errors ("first error (4 additional errors)"?)
  • (secondary goal) store jobsComplete / jobsWaiting / jobsFailed

Workbench

If a running container has error or warning in its runtime_status, Workbench should flag it with a color/label to distinguish it from the normal "running" state (perhaps also showing the error/warning message in a tooltip) on the dashboard and other summary views.

Workbench should display any error or warning messages prominently in the detailed view.

Additional ideas

These features are anticipated but they are not expected to be included in the initial implementation:
  • crunch-dispatch-slurm can update the activity field to indicate "in slurm queue"
  • crunch-run can update the activity field to indicate loading Docker image or uploading output
  • crunch-run or arv-mount can detect likely cache thrashing conditions and generate a warning
  • arvados-cwl-runner reports additional structured error details under errorDetails for Workbench to display
runtime status error warning.png (78.6 KB) runtime status error warning.png Lucas Di Pentima, 09/10/2018 07:53 PM
fail.cwl (537 Bytes) fail.cwl Peter Amstutz, 09/13/2018 07:03 PM

Subtasks

Task #13843: Review 13773-will-fail-container-statusResolvedLucas Di Pentima


Related issues

Related to Arvados - Bug #13772: Rerunning a container_request that has a failed child CR should restart the failed CRNew

Associated revisions

Revision 1010c9e5
Added by Lucas Di Pentima 11 months ago

Merge branch '13773-will-fail-container-status'
Closes #13773

Arvados-DCO-1.1-Signed-off-by: Lucas Di Pentima <>

History

#1 Updated by Peter Amstutz about 1 year ago

  • Status changed from New to In Progress

#2 Updated by Peter Amstutz about 1 year ago

  • Subject changed from "Will fail" status to prevent reuse of failing (but not yet failed) container to "Will fail" status for failing (but not yet failed) containers
  • Description updated (diff)
  • Status changed from In Progress to New

#3 Updated by Peter Amstutz about 1 year ago

  • Description updated (diff)

#4 Updated by Tom Clegg about 1 year ago

  • Description updated (diff)

#5 Updated by Peter Amstutz about 1 year ago

Introduce a "status" field with some well defined fields.

#6 Updated by Peter Amstutz about 1 year ago

  • Description updated (diff)

#7 Updated by Peter Amstutz about 1 year ago

  • Description updated (diff)

#8 Updated by Tom Morris about 1 year ago

  • Target version changed from To Be Groomed to Arvados Future Sprints
  • Story points set to 3.0

#9 Updated by Tom Morris about 1 year ago

  • Target version changed from Arvados Future Sprints to 2018-08-01 Sprint

#10 Updated by Lucas Di Pentima about 1 year ago

  • Assigned To set to Lucas Di Pentima

#11 Updated by Tom Clegg about 1 year ago

  • Description updated (diff)

#12 Updated by Peter Amstutz about 1 year ago

The idea of having error/warning be a list was is to help communicate multiple errors (eg steps 2, 5 and 7 failed).

It should be possible to update runtime_status when in "Locked" state, because currently the locked state includes the container initialization (we'd like to communicate things like "current activity is loading Docker image".

#13 Updated by Tom Clegg about 1 year ago

  • Description updated (diff)
Regarding multiple errors: the "error" or "warning" values should be a single human-readable summary of the error/warning condition. Rather than giving an array of values and letting Workbench guess which ones are most important to display, clients should
  • choose an appropriate way to report the multiple-error condition to a user, and store that string in the "error" field (e.g., "first error", "last error", "first error (N more errors)")
  • optionally store an array of individual errors (and any other error details) under a different key like errorDetails.

Updated description to allow updates when state=Locked, and clear when state=Queued.

#14 Updated by Peter Amstutz about 1 year ago

Tom Clegg wrote:

Regarding multiple errors: the "error" or "warning" values should be a single human-readable summary of the error/warning condition. Rather than giving an array of values and letting Workbench guess which ones are most important to display, clients should
  • choose an appropriate way to report the multiple-error condition to a user, and store that string in the "error" field (e.g., "first error", "last error", "first error (N more errors)")
  • optionally store an array of individual errors (and any other error details) under a different key like errorDetails.

So, my thought was that if it is an array of errors, workbench can just present it as a bulleted list. if it is a single string, and there are multiple errors, then the client could format them, but then we need to define how workbench is supposed to handle the formatting. Maybe allow textile in that field?

#15 Updated by Tom Clegg about 1 year ago

I still think the "error" value, which all presentation code is expected to handle, is most useful as a plain old string. This makes it usable in a terminal window, a tooltip, a program that wants to wrap an error ("baz workflow failed: $x"), etc. The reporting client is welcome to store additional structured information in other keys -- but it still has to come up with a string, too. (If it's hard for arvados-cwl-runner to summarize its own multiple-error conditions usefully, I don't think it will get any easier in Workbench...)

#16 Updated by Peter Amstutz about 1 year ago

Tom Clegg wrote:

I still think the "error" value, which all presentation code is expected to handle, is most useful as a plain old string. This makes it usable in a terminal window, a tooltip, a program that wants to wrap an error ("baz workflow failed: $x"), etc. The reporting client is welcome to store additional structured information in other keys -- but it still has to come up with a string, too. (If it's hard for arvados-cwl-runner to summarize its own multiple-error conditions usefully, I don't think it will get any easier in Workbench...)

So assume preformatted plain text (ie respect all line breaks)? That's fine, we just need to be clear about it. One of the goals here is for arvados-cwl-runner to be able to surface errors which came from a failing step instead of making the user sift through logs. However right now that means returning the last 40 lines of the step's stdout/stderr (because we don't know exactly where the error message is.)

#17 Updated by Peter Amstutz about 1 year ago

Maybe think about what how we want errors to be presented on workbench, and then work back from there? Separate "error" (one line summary) and "error_details" (longer, formatted) would make sense.

#18 Updated by Tom Clegg about 1 year ago

  • Description updated (diff)

#19 Updated by Tom Clegg about 1 year ago

  • Description updated (diff)

#20 Updated by Lucas Di Pentima about 1 year ago

  • Status changed from New to In Progress

#21 Updated by Lucas Di Pentima about 1 year ago

  • Target version changed from 2018-08-01 Sprint to 2018-08-15 Sprint

#22 Updated by Lucas Di Pentima about 1 year ago

  • Target version changed from 2018-08-15 Sprint to 2018-09-05 Sprint

#23 Updated by Lucas Di Pentima 12 months ago

  • Target version changed from 2018-09-05 Sprint to 2018-09-19 Sprint

#24 Updated by Lucas Di Pentima 12 months ago

Updates at 4a466ccc5 - branch 13773-will-fail-container-status
Test run: https://ci.curoverse.com/job/developer-run-tests/876/

  • Add runtime_status Hash attribute to containers on the API Server
  • Avoid container reuse when there's an error key on runtime_status
  • Add documentation about error, errorDetails, warning & activity keys
  • On a-c-r, update runtime_status on runner containers when a child fails, including the child's name & uuid on the error key and the last 40 log lines on errorDetails.
    • Add/update (and N more) string on the error key when more than 1 child fails.
  • On workbench, show "Failing" and "Warning" labels on running containers with relevant runtime_status information. On the dashboard also show a tooltip with error/warning messages and on the detailed view show a panel, including errorDetails in the error case.
  • Add tests on API Server & arvados-cwl-runner. Workbench tests pending.

#25 Updated by Peter Amstutz 12 months ago

If only certain errors get recorded, people are going to be confused and we'll be fielding support requests from users who don't know when to look in the logs tab. For example, the current implementation wouldn't show CWL syntax errors in runtime_status.

In arvados-cwl-runner, how hard would it be to intercept logging.exception(), logging.error(), and logging.warning() and have that call runtime_status_error()? I think this is mostly a matter of configuring the root logger with a custom output handler.

#26 Updated by Peter Amstutz 12 months ago

In workbench:

        <div id="errorDetail" class="collapse">
          <pre><%= wu.runtime_status[:errorDetail] %></pre>
        </div>

Does this sanitize :errorDetail (and other places where the :error message is rendered)?

Minor documentation nitpicking:

"error": "This container won't be successful because at least one step have already failed."

Should be "has already failed."

h2(#runtime_status). {% include 'container_runtime_status' %}

This doesn't need to be included from a separate file, the reason the other sections are that way is because they are included in both containers and container_requests, but this one is container specific.

#27 Updated by Peter Amstutz 12 months ago

Workbench presentation:

  • I suggest using a similar layout to the workflow steps. This would be a panel where the title of the panel is the error message followed by a little down-pointing triangle. The body of the panel will be initially hidden, and clicking anywhere on the error message opens the panel body to display the full error.
  • If there is a warning, it should always be displayed (not if/else with error).
  • The value of "activity" isn't displayed on Workbench at all.
A couple additional points:
  • The API server should validate that error, errorDetails, warning and activity are strings in runtime_status, if present.
  • Should we add "warningDetails"?

Nitpicking aside, this looks really good, I tried it out with a test case (a workflow with one failing and one passing step) and it worked exactly as expected.

#28 Updated by Lucas Di Pentima 11 months ago

I've written a log handler that intercepts log messages and updates runtime_status as an error/warning/activity status. When the log string contains a '\n' character, it uses it as a "kindDetail" field. The issue that I'm now having is on the error case: when a child container fails, the error log on the runner may arrive before the child running its runner's update status, so the errorDetail that we got from the 40 lines of error logs now are lost because only the first error occurrence is saved.

I've attached an example screenshot to also show how errors and warnings are now displayed simultaneously. I didn't make it work this way at first because I supposed that if there's an error, a warning is superfluous.

Regarding the activity status, do you have something in mind as which UI element to be used on the detailed/dashboard views?

#29 Updated by Peter Amstutz 11 months ago

My idea was that you'd take the first line of the first error message to use as "error", and then all subsequent error logging would get appended to errorDetail so that everything noteworthy is being displayed.

Warnings should handled the same way.

Activity should only be the most recent status. I don't think it makes sense for activity to be the same as the "info" log level, it should probably be better thought out as its own feature (and maybe incorporate updating the "progress" field as well.)

#30 Updated by Peter Amstutz 11 months ago

Tom has pointed out that detailed error reporting is out of scope for the ticket, so disregard my comments, let's merge the minimal version of this feature.

#31 Updated by Lucas Di Pentima 11 months ago

Updates at 3c08b21fd
Test run: https://ci.curoverse.com/job/developer-run-tests/887/

To avoid scope creeping, the updates include:

  • Intercept logging (warning & error) messages to save them as runtime status messages. This logging handler will populate the [error|warning]Detail field when the logged message is multi-line
  • Update documentation as suggested
  • Update a-c-r test so that it checks the logging handler does its task
  • Enhance runtime status error/warning display on workbench's container detailed page
  • Sanitize runtime status information when displaying it on workbench
  • Add runtime_status data type validation on the api server (w/test)

#32 Updated by Lucas Di Pentima 11 months ago

Updates at 58f3f7ab5
Test run: https://ci.curoverse.com/job/developer-run-tests/889/

Fixed test that failed when running the entire sdk/cwl suite.

#33 Updated by Peter Amstutz 11 months ago

Nitpick, "levelno" is an integer, and (ERROR, WARNING, INFO) are something like (40, 30, 20) so I think you could use >= instead (in fact, there's a "CRITICAL" logging level which is levelno=50)

        if record.levelno == logging.ERROR:
            kind = 'error'
        elif record.levelno == logging.WARNING:
            kind = 'warning'
        elif record.levelno == logging.INFO:
            kind = 'activity'

I tried running a workflow. I think there's a bug. I ran a test workflow and it never submitted a child workflow. The last update to "activity" does not correspond to the last INFO line of stderr. I'm suspicious that something in the new logging may have failed silently and left the workflow runner deadlocked.

2018-09-13T18:34:51.201134723Z cwltool INFO: /usr/local/bin/arvados-cwl-runner 1.2.0.20180907205721, arvados-python-client 1.2.0.20180824213733, cwltool 1.0.20180806194258
2018-09-13T18:34:51.357159071Z cwltool INFO: Resolved '/var/lib/cwl/workflow.json#main' to 'file:///var/lib/cwl/workflow.json#main'
2018-09-13T18:34:53.259160187Z arvados.arv-run INFO: Using empty collection d41d8cd98f00b204e9800998ecf8427e+0
2018-09-13T18:34:53.389896614Z arvados.arv-run INFO: Using empty collection d41d8cd98f00b204e9800998ecf8427e+0
2018-09-13T18:34:53.548907141Z arvados.arv-run INFO: Using empty collection d41d8cd98f00b204e9800998ecf8427e+0
2018-09-13T18:34:53.744135935Z arvados.arv-run INFO: Using empty collection d41d8cd98f00b204e9800998ecf8427e+0
2018-09-13T18:34:53.902239799Z cwltool INFO: [workflow workflow.json#main] start
2018-09-13T18:34:53.902474964Z cwltool INFO: [workflow workflow.json#main] starting step s1
  "runtime_status": {
    "activity": "arvados.arv-run: Using empty collection d41d8cd98f00b204e9800998ecf8427e+0" 
  }

#34 Updated by Peter Amstutz 11 months ago

Attached is the test workflow I used.

#35 Updated by Lucas Di Pentima 11 months ago

Update at 6451f2e4a
Test run: https://ci.curoverse.com/job/developer-run-tests/892/

Updated logging level comparison as requested by Peter. Both ERROR & CRITICAL logging will go to error runtime_status.

#36 Updated by Peter Amstutz 11 months ago

Lucas Di Pentima wrote:

Update at 6451f2e4a
Test run: https://ci.curoverse.com/job/developer-run-tests/892/

Updated logging level comparison as requested by Peter. Both ERROR & CRITICAL logging will go to error runtime_status.

Thanks.

I was not on the latest branch when I ran the test workflow. On the latest branch, it worked as expected.

This LGTM.

#37 Updated by Lucas Di Pentima 11 months ago

Merged master into the branch, resolved a migration conflict at 30d37841e
Running tests one last time before merging: https://ci.curoverse.com/job/developer-run-tests/893/

#38 Updated by Lucas Di Pentima 11 months ago

  • Status changed from In Progress to Resolved

#39 Updated by Tom Morris 9 months ago

  • Release set to 14

#40 Updated by Tom Clegg 5 months ago

  • Related to Bug #13772: Rerunning a container_request that has a failed child CR should restart the failed CR added

Also available in: Atom PDF