Feature #4302

[Crunch] Pipelines should not fail immediately after one job failure, but continue running as much as possible

Added by Bryan Cosca about 4 years ago. Updated about 4 years ago.

Assigned To:
Target version:
Start date:
Due date:
% Done:


Estimated time:
Story points:


The current pipeline running model very linear:
If A > B > C > D (the output of a is needed to run b, etc). if B fails, C and D do not get run and then the pipeline instance fails.

Lets say if A > B > C > D and B > E > F. If C fails, D does not get run and the entire pipeline instance fails. BUT what if you want to see if E and F complete? With the current model, E does get run but the output does not get saved (for example: qr1hi-8i9sb-56vgstlp2wk56vn). I would love to know if F gets completed before i go and edit the template and look into C and D, but we cannot because E's output does not get fed into F.

Lets say that there are more of these branches... If B branches out to 20 other jobs, and one of those branches fail, the other 19 get affected, which wastes a ton of time. The bioinformatician has to edit the pipeline template and remove those failed jobs and rerun on the 19 other branches. If one of those branches fail then its more editing, etc. A ton of time could be saved if the pipeline is run and all the branches finish (failed or success) and then editing can be done after the branches finish. Lets say 10 of those branches actually complete, then you saved 10 edit processes and sitting at your computer 10x as much. It would be easy to just wait and edit that template once after all jobs are complete.

Also, the scenario where the bioinformatician runs a pipeline and walks away for a couple hours to see nothing has been outputted would be kind of frustrating because the bioinformatician would have to rerun the pipeline and then do nothing but wait. He could have been analyzing the other branches in the pipeline and doing something useful, rather than waiting for his pipeline to finish.


#1 Updated by Brett Smith about 4 years ago

  • Subject changed from Partial Failure for pipelines to [Crunch] Pipelines should not fail immediately after one job failure, but continue running as much as possible
  • Category set to Crunch
  • Target version set to Arvados Future Sprints

#2 Updated by Tom Clegg about 4 years ago

arv-run-pipeline-instance has (had?) this option, but there has never been a way in Workbench to specify whether you want it. (Sometimes you really do want to halt, and stop wasting resources, as soon as something doesn't work as expected. a-r-p-i mimics the "make" / "make -k" convention, i.e., the default is to stop as soon as one thing fails.)

Also available in: Atom PDF