Project

General

Profile

Actions

Feature #13773

closed

"Will fail" status for failing (but not yet failed) containers

Added by Peter Amstutz almost 6 years ago. Updated over 5 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Story points:
3.0
Release:
Release relationship:
Auto

Description

It is useful to know when a container is going to fail, but hasn't completed yet.

Workflow developers / users want to know this so workflows can be resubmitted.

arvados-cwl-runner wants to use this to avoid reusing an arvados-cwl-runner container which has already decided to fail.

Proposed implementation

API

Add a runtime_status serialized hash attribute to containers model on API server, stored as an indexed jsonb column.
  • runtime_status can be updated when state∈{"Locked", "Running"}.
  • runtime_status is cleared if state changes from "Locked" to "Queued" (to avoid leaking status messages between different dispatch attempts).

If a container with state=="Running" has an error key in its runtime_status then it must not be a candidate for reuse. (Note "Locked" state is deliberately omitted here because dispatch/setup errors are retryable.)

Documentation

Well known keys in runtime_status should be documented on the container schema page:
  • error: string, indicates the container will definitely fail, or has already failed
  • warning: string, indicates something unusual happened or is currently happening, but isn't considered fatal
  • activity: string, a message for the end user about what state the container is currently in

arvados-cwl-runner

  • store the first fatal error (failed child, error in workflow definition) in error
  • (secondary goal) mention any additional errors ("first error (4 additional errors)"?)
  • (secondary goal) store jobsComplete / jobsWaiting / jobsFailed

Workbench

If a running container has error or warning in its runtime_status, Workbench should flag it with a color/label to distinguish it from the normal "running" state (perhaps also showing the error/warning message in a tooltip) on the dashboard and other summary views.

Workbench should display any error or warning messages prominently in the detailed view.

Additional ideas

These features are anticipated but they are not expected to be included in the initial implementation:
  • crunch-dispatch-slurm can update the activity field to indicate "in slurm queue"
  • crunch-run can update the activity field to indicate loading Docker image or uploading output
  • crunch-run or arv-mount can detect likely cache thrashing conditions and generate a warning
  • arvados-cwl-runner reports additional structured error details under errorDetails for Workbench to display

Files

runtime status error warning.png (78.6 KB) runtime status error warning.png Lucas Di Pentima, 09/10/2018 07:53 PM
fail.cwl (537 Bytes) fail.cwl Peter Amstutz, 09/13/2018 07:03 PM

Subtasks 1 (0 open1 closed)

Task #13843: Review 13773-will-fail-container-statusResolvedLucas Di Pentima09/06/2018Actions

Related issues

Related to Arvados - Bug #13772: Rerunning a container_request that has a failed child CR should restart the failed CRNewActions
Actions

Also available in: Atom PDF