Feature #13773
Updated by Peter Amstutz over 6 years ago
It is useful to know when a container is going to fail, but hasn't completed yet. Workflow developers / users want to know this so workflows can be resubmitted. arvados-cwl-runner wants to use this to avoid reusing an arvados-cwl-runner container which has already decided to fail. Need to: * Decide how it should be represented on the container record; see [[Container status / outcome reporting]] * Add required field(s) to database and API models * Make arvados-cwl-runner set "will fail" status on the container when a workflow step fails. * API server should take "will fail" status into account when searching for containers to reuse (via "filters" in arvados-cwl-runner's container request) * In Workbench, display "will fail" status differently from "running" h2. Proposed design Add a @status@ field to @containers@ model on API server, as an indexed jsonb field. This field can be updated in all non-final states (Queued/Locked/Running). Document well known subproperties under @status@: * @error@: list of strings, indicates the container will definitely fail, or has already failed * @warning@: list of strings, indicates something unusual happened or is currently happening, but isn't considered fatal * @activity@: string, a message for the end user about what state the container is currently in Update arvados-cwl-runner to put error messages (such as failed jobs, or errors in the CWL) into @error@ arvados-cwl-runner should report jobsComplete / jobsWaiting / jobsFailed If a container has an @error@ field in its @status@ then it must not be a candidate for reuse. If a container has @error@ or @warning@ in its @status@, Workbench should display it differently. Workbench should display @error@ or @warning@ in the work unit display. h3. Additional ideas Crunch-dispatch-slurm can update the @activity@ field to indicate "in slurm queue" Crunch-run can update the @activity@ field to indicate @loading Docker image@ or @uploading output@ Crunch-run or arv-mount can detect likely cache thrashing conditions and generate a warning