Project

General

Profile

Actions

Support #3401

closed

[Documentation] Job Re-Use Kludge

Added by Abram Connelly over 9 years ago. Updated almost 6 years ago.

Status:
Closed
Priority:
Normal
Assigned To:
-
Category:
Documentation
Target version:
-
Due date:
Story points:
-

Description

I have a pipeline that has two jobs in it. The first job takes about an hour and a half and the second job takes over an hour to run. During development, I noticed that altering scripts and programs from the second job would cause the whole pipeline to be run. This surprised me because I had not altered any scripts or programs from the first job nor altered any of it's inputs.

Though initially unanticipated, from discussions I realize that this is expected and desirable behavior. Since both of my jobs had their 'script_version' set to 'master', there is no way for Arvados to know that the first job was actually untouched and so, to be safe, it re-ran the first job.

Since the first job took so long and the first job was debugged and working, I wanted to re-use the output from the first job. In order to get around having Arvados re-run the whole pipeline, I created a second pipeline that only used the second job and hard coded the output collection from the first job into the second job. This allowed me to test the second leg of the pipeline without waiting for the first job to finish every development iteration.

From discussions, I realize this was a very bad way to develop and there are features in Arvados that should help with this development cycle. I used the examples from the tutorial as a basis for my pipeline template and those all have 'master' for their script_version and that's why my jobs all had 'master' as 'script_version'.

What is the recommended workflow to re-use jobs in my pipeline that I know should be re-used? I've heard that a potential workflow is to 'tag' different positions in the git repository history and then use these in the 'script_version' pipeline parameter to make sure legs of the job I don't want to be re-run will not be dependent on the latest check in. How do I do this? Is there a tutorial or some documentation I should be looking?

Also how do I transition my pipeline from the development phase to a production phase?

For completeness, here is the pipeline I have:

{

  "name": "Library Reference Tile Set Pipeline",

  "components": {

    "CreateBandedBedGraphFiles": {
      "script_parameters": {
        "input": {
          "required": true,
          "dataclass" : "Collection" 
        },
        "bigWigToBedGraph": "BIGWIGLOCATER",
        "cytoBand" : "UCSC.CYTOBAND",
        "createBandBedGraph" : "CREATEBANDBEDGRAPH" 
      },
      "repository": "$USER",
      "script_version": "master",
      "script": "referenceTileSetPipeline/create24merChoppedBedGraph.py" 
    },

    "ConstructTileSet" : {
      "script_parameters": {
        "input": {
          "output_of": "CreateBandedBedGraphFiles" 
        },
        "cytoBand" : "CYTOBAND",
        "buildTileSet" : "BUILDTILESET",
        "hg19.fa" : "HG19FA",
        "tileLength" : 200

      },
      "script_version": "master",
      "repository": "$USER",
      "script": "referenceTileSetPipeline/buildTileSet.py" 
    }

  }

}

and my kludge to only run the second job of the pipeline:

{

  "name": "Library Reference Tile Set Pipeline",

  "components": {

    "ConstructTileSet" : {
      "script_parameters": {
        "input": {
          "required" : "true",
          "dataclass": "Collection" 
        },
        "cytoBand" : "UCSC.CYTOBAND",
        "buildTileSet" : "BUILDTILESET",
        "echofile" : "ECHOFILE",
        "hg19.fa" : "HG19FA",
        "tileLength" : 200

      },
      "script_version": "master",
      "repository": "$USER",
      "script": "referenceTileSetPipeline/buildTileSet.py" 
    }

  }

}

In both, I have a shell script that does a subsititution for the field values that are all in caps and replaces the '$USER' string with my own username ('abram').


Related issues

Related to Arvados - Idea #3407: [Documentation] Pipeline development workflowClosed07/29/2014Actions
Related to Arvados - Idea #3511: [Documentation] Present an efficient pattern for developing a pipeline template with multiple crunch scriptsClosedActions
Actions #1

Updated by Tom Clegg over 9 years ago

  • Target version set to 2014-08-27 Sprint
Actions #2

Updated by Tom Clegg over 9 years ago

  • Subject changed from Job Re-Use Kludge to [Documentation] Job Re-Use Kludge
Actions #3

Updated by Ward Vandewege over 9 years ago

  • Story points set to 0.5
Actions #4

Updated by Tim Pierce over 9 years ago

  • Category set to Documentation
Actions #5

Updated by Tom Clegg over 9 years ago

  • Story points deleted (0.5)
Actions #6

Updated by Peter Amstutz over 9 years ago

  • Tracker changed from Idea to Task
Actions #7

Updated by Peter Amstutz over 9 years ago

  • Tracker changed from Task to Support
Actions #8

Updated by Peter Amstutz over 9 years ago

  • Target version deleted (2014-08-27 Sprint)
Actions #9

Updated by Peter Amstutz almost 6 years ago

  • Status changed from New to Closed

(obsolete)

Actions

Also available in: Atom PDF