Project

General

Profile

Idea #3407

Updated by Abram Connelly almost 10 years ago

I am trying to develop common idioms or best practices for pipeline development to make sure the development cycle is kept to a minimum by re-using jobs when necessary. 

 Though common idioms are the goal, this note has the following concern mostly in mind when discussing job re-use: 

 - A multi step pipeline is created, with each job being run with the corresponding input and output saved. 
 - A bug is found in a job further down the pipeline. 
 - The source files for the job are updated with the hope of fixing the bug. 
 - The pipeline is re-submitted to see if the bug fix corrected the problem. 

 Ideally, the previous jobs that were unaffected by the source file would not need to be rerun.    Should the pipeline not be setup properly, the pipeline in the above example will be completely rerun. 

 Setting the expectation of when jobs are able to be re-used or re-run and how to properly use the job re-use capability of Arvados is the purpose of this story. 

 There have been three workflows that have been proposed as a result of discussing it with others: 

   1) Create all of your pipelines and jobs under one repository.    During developement, decorate each job with it's own 'script_version' or git tag to make sure it doesn't get re-run when the master branch is updated.    When development is finished, 'lock in' all jobs by giving them the final 'script_version' or git tag that represents the snapshot of working state. 

   2) Create all of your pipelines and jobs under one repository but separate jobs you want to keep isolated into their own branches so that work in one branch will not cause jobs in other branches to be slated for rerun. 

   3) Create a separate git repository for isolated jobs. 


 Option (1) is what functionality in Arvados is trying to address.    The drawbacks I can see are that this will put too much cognitive load on the pipeline implementer by requiring them to create tags, record script versions and update the pipeline template at each step in the development process.    This could also lead to confusion when jobs are re-used because of an older script revision even though the pipeline implementor might have updated the underlying scripts or programs. 

 Option (2) keeps isolation of different jobs by putting them in their own branch.    This encourages a workflow whereby a pipeline implementor sets the 'script_version' to the latest snapshot and expects jobs to be re-run whenever source files are altered within that branch.    An environment could be set up so that there are separate directories for each branch.    Each job sits within it's own branch, so the pipeline implementor knows that alteration within one branch will not affect other jobs or branches.    A potential pitfall of this workflow is that there is a potentially confusing initial setup and duplication of code should the branches not be cultivated with clarity in mind. 

 Option (3) opts for a completely separate repository per relevant job.    This is similar in philosphy to option (2) but with the potential benefit of having a simpler setup step and a cleaner maintenance. 

 Option (2) and (3) both allow for pipeline 'cloning' but option (3) might be slightly easier to use as a branch name does not need to be included. 

 A potential pitfall of both options (2) and (3) are if a job is used in multiple pipelines.    This might cause pipelines that are thought to be concluded slated for re-run, when, for example, a job is altered with some other pipeline update in mind. 

 There has been a proposal to import other repositories, say from GitHub or GitLab, for use in a crunch pipeline.    In my mind, this encourages option (3) as the idiomatic use case. 

 My preference is for option (3) (one repository per job) as this seems like the most natural separation, but I don't have a good sense for what the 'proper' workflow is. 
 

 There is a tension between the expectation of jobs to be re-run when required and jobs to be re-used when able.    Arvados takes the correct stance that if a job has its 'script_version' branch change, it should re-run, regardless of what actual script it points to in the pipeline

Back