Project

General

Profile

Feature #13773

Updated by Tom Clegg almost 6 years ago

It is useful to know when a container is going to fail, but hasn't completed yet. 

 Workflow developers / users want to know this so workflows can be resubmitted. 

 arvados-cwl-runner wants to use this to avoid reusing an arvados-cwl-runner container which has already decided to fail. 

 Need to: 

 * Decide how it should be represented on the container record; see [[Container status / outcome reporting]] 
 * Add required field(s) to database and API models 
 * Make arvados-cwl-runner set "will fail" status on the container when a workflow step fails. 
 * API server should take "will fail" status into account when searching for containers to reuse (via "filters" in arvados-cwl-runner's container request) 
 * In Workbench, display "will fail" status differently from "running" 

 h2. Proposed implementation design 

 h3. API 

 Add a @runtime_status@ serialized hash attribute @status@ field to @containers@ model on API server, stored as an indexed jsonb column. field.    This field can be updated when @state=="Running"@. 

 If a container with state=="Running" has an @error@ key in its @runtime_status@ then it must not be a candidate for reuse. all non-final states (Queued/Locked/Running). 

 h3. Documentation Document well known subproperties under @status@: 

 Well known keys in @runtime_status@ should be documented on the container schema page: 
 * @error@: string, list of strings, indicates the container will definitely fail, or has already failed 
 * @warning@: string, list of strings, indicates something unusual happened or is currently happening, but isn't considered fatal 
 * @activity@: string, a message for the end user about what state the container is currently in 

 h3. Update arvados-cwl-runner 

 * store the first fatal to put error (failed child, error messages (such as failed jobs, or errors in workflow definition) in the CWL) into @error@ 
 * mention any additional errors ("first error (4 additional errors)"?) 
 * store 

 arvados-cwl-runner should report jobsComplete / jobsWaiting / jobsFailed 

 h3. Workbench If a container has an @error@ field in its @status@ then it must not be a candidate for reuse. 

 If a running container has @error@ or @warning@ in its @runtime_status@, @status@, Workbench should flag display it with a color/label to distinguish it from the normal "running" state (perhaps also showing the error/warning message in a tooltip) on the dashboard and other summary views. differently. 

 Workbench should display any @error@ or @warning@ messages prominently in the detailed view. work unit display. 

 h3. Additional ideas 

 These features are anticipated but they are _not_ expected to be included in the initial implementation: 
 * crunch-dispatch-slurm Crunch-dispatch-slurm can update the @activity@ field to indicate "in slurm queue" 
 * crunch-run 

 Crunch-run can update the @activity@ field to indicate @loading Docker image@ or @uploading output@ 
 * crunch-run 

 Crunch-run or arv-mount can detect likely cache thrashing conditions and generate a warning 

Back