Git strategy for pipeline development

The following scenario is common:
  • You have a project that involves one or two pipelines
  • Each pipeline have many components
  • These pipelines and components make use of a common code base

Example:

crunch scripts in repo pipeline A pipeline B
crunch_scripts/align
crunch_scripts/call
crunch_scripts/compare
align(1)    align(2)
   |           |
call(1)     call(2)
   |____   ____|
        | |
      compare
align(1)    align(2)    align(3)
   |           |           |
call(1)     call(2)     call(3)
   |_________  |  _________|
             | | |
            compare
While developing the code you can expect to have moments like these:
  • Fix a bug in compare that was making it fail when given 3 inputs.
  • Fix the code in commit "C", push, and re-run. (Note: results from previous runs that succeeded are still valid.)
  • Find a bug in compare that was making it produce incorrect output when given 3 inputs.
  • Fix the code in commit "E", push, and re-run. Update pipeline template B to prevent the broken jobs (the ones that are marked "success" but produced incorrect outputs) from being re-used in future pipeline runs. (Note: results from previous jobs from pipeline A are still OK.)

A strategy

The master branch has the latest stable version of everything.

For each component of each template, tag the oldest acceptable version.

tags →  pipelineA-compare   pipelineB-compare
                        ↓                   ↓
commits →               A----B----C----D----E----F
                                                 ↑
branches →                                       master
Tell your pipeline (or other job creation script) to
  • re-use existing jobs as long as they use a version newer than pipelineB-compare
  • use "master" if no existing job is suitable
Or, you might use tags like
  • "compare" -- just use this tag for all pipelines that use the "compare" script. When you find a bug that produced bad output, just move the tag, and all pipelines will stop using the buggy code.
  • "compare-3way-bugfix" -- tag each bugfix, and add them to the pipelines where the bug could be a problem. Of course this presumes you'd rather trust yourself to keep track of which pipelines need which bugfixes than waste resources re-generating perfectly good outputs.
  • "ok" -- tag the whole repo, and use this as the earliest acceptable revision for all jobs/components. This is safer: if you fix library code in file C in order to fix job A, and forget that script B also uses code from file C, everything is fine because all jobs that used anything in the old repo will be ineligible for reuse.

The way you specify the a range of acceptable revisions is a bit weird, but here it is:

arv job create --job '{
 "repository":"username/reponame",
 "script":"compare",
 "script_version":"master",
 "script_parameters":{
  "foo":"bar" 
 }
}' --filters '[
 [
  "repository",
  "=",
  "username/reponame" 
 ],
 [
  "script",
  "=",
  "compare" 
 ],
 [
  "script_version",
  "in git",
  "pipelineB-compare" 
 ]
]'

In a pipeline template:

"components":{
 "compare":{
  "repository":"username/reponame",
  "script":"compare",
  "script_version":"master",
  "script_parameters":{
   "foo":"bar" 
  },
  "filters":[
   [
    "repository",
    "=",
    "username/reponame" 
   ],
   [
    "script",
    "=",
    "compare" 
   ],
   [
    "script_version",
    "in git",
    "pipelineB-compare" 
   ]
  ]
 }
}