Project

General

Profile

Actions

Idea #3407

closed

[Documentation] Pipeline development workflow

Added by Abram Connelly over 9 years ago. Updated over 4 years ago.

Status:
Closed
Priority:
Normal
Assigned To:
-
Category:
Documentation
Target version:
-
Start date:
07/29/2014
Due date:
Story points:
-

Description

I am trying to develop common idioms or best practices for pipeline development to make sure the development cycle is kept to a minimum by re-using jobs when necessary.

Though common idioms are the goal, this note has the following concern mostly in mind when discussing job re-use:

  • A multi step pipeline is created, with each job being run with the corresponding input and output saved.
  • A bug is found in a job further down the pipeline.
  • The source files for the job are updated with the hope of fixing the bug.
  • The pipeline is re-submitted to see if the bug fix corrected the problem.

Ideally, the previous jobs that were unaffected by the source file would not need to be rerun. Should the pipeline not be setup properly, the pipeline in the above example will be completely rerun.

Setting the expectation of when jobs are able to be re-used or re-run and how to properly use the job re-use capability of Arvados is the purpose of this story.

There have been three workflows that have been proposed as a result of discussing it with others:

  1. Create all of your pipelines and jobs under one repository. During development, decorate each job with it's own 'script_version' or git tag to make sure it doesn't get re-run when the master branch is updated. When development is finished, 'lock in' all jobs by giving them the final 'script_version' or git tag that represents the snapshot of working state.
  2. Create all of your pipelines and jobs under one repository but separate jobs you want to keep isolated into their own branches so that work in one branch will not cause jobs in other branches to be slated for rerun.
  3. Create a separate git repository for isolated jobs.

Option (1) is what functionality in Arvados is trying to address. The drawbacks I can see are that this will put too much cognitive load on the pipeline implementer by requiring them to create tags, record script versions and update the pipeline template at each step in the development process. This could also lead to confusion when jobs are re-used because of an older script revision even though the pipeline implementor might have updated the underlying scripts or programs.

Option (2) keeps isolation of different jobs by putting them in their own branch. This encourages a workflow whereby a pipeline implementor sets the 'script_version' to the latest snapshot and expects jobs to be re-run whenever source files are altered within that branch. An environment could be set up so that there are separate directories for each branch. Each job sits within it's own branch, so the pipeline implementor knows that alteration within one branch will not affect other jobs or branches. A potential pitfall of this workflow is that there is a potentially confusing initial setup and duplication of code should the branches not be cultivated with clarity in mind.

Option (3) opts for a completely separate repository per relevant job. This is similar in philosphy to option (2) but with the potential benefit of having a simpler setup step and a cleaner maintenance.

Option (2) and (3) both allow for pipeline 'cloning' but option (3) might be slightly easier to use as a branch name does not need to be included.

A potential pitfall of both options (2) and (3) are if a job is used in multiple pipelines. This might cause pipelines that are thought to be concluded slated for re-run, when, for example, a job is altered with some other pipeline update in mind.

There has been a proposal to import other repositories, say from GitHub or GitLab, for use in a crunch pipeline. In my mind, this encourages option (3) as the idiomatic use case.

My preference is for option (3) (one repository per job) as this seems like the most natural separation, but I don't have a good sense for what the 'proper' workflow is.


Related issues

Related to Arvados - Support #3401: [Documentation] Job Re-Use KludgeClosed07/29/2014Actions
Related to Arvados - Idea #3511: [Documentation] Present an efficient pattern for developing a pipeline template with multiple crunch scriptsClosedActions
Actions #1

Updated by Abram Connelly over 9 years ago

  • Description updated (diff)
Actions #2

Updated by Tom Clegg over 9 years ago

  • Description updated (diff)
Actions #3

Updated by Tom Clegg over 9 years ago

  • Subject changed from Pipeline workflow to [Documentation] Pipeline development workflow
  • Story points set to 2.0
Actions #4

Updated by Tom Clegg over 9 years ago

  • Target version set to 2014-08-27 Sprint
Actions #5

Updated by Tim Pierce over 9 years ago

  • Category set to Documentation
Actions #6

Updated by Tom Clegg over 9 years ago

  • Story points deleted (2.0)
Actions #7

Updated by Peter Amstutz over 9 years ago

  • Target version deleted (2014-08-27 Sprint)
Actions #8

Updated by Peter Amstutz over 4 years ago

  • Status changed from New to Closed
Actions

Also available in: Atom PDF