Support #3401: [Documentation] Job Re-Use Kludge - Arvados

Actions

Copy link

Support #3401

closed

[Documentation] Job Re-Use Kludge

Added by Abram Connelly almost 10 years ago. Updated about 6 years ago.

Status:

Closed

Priority:

Normal

Assigned To:

Category:

Documentation

Target version:

Due date:

Story points:

Description

I have a pipeline that has two jobs in it. The first job takes about an hour and a half and the second job takes over an hour to run. During development, I noticed that altering scripts and programs from the second job would cause the whole pipeline to be run. This surprised me because I had not altered any scripts or programs from the first job nor altered any of it's inputs.

Though initially unanticipated, from discussions I realize that this is expected and desirable behavior. Since both of my jobs had their 'script_version' set to 'master', there is no way for Arvados to know that the first job was actually untouched and so, to be safe, it re-ran the first job.

Since the first job took so long and the first job was debugged and working, I wanted to re-use the output from the first job. In order to get around having Arvados re-run the whole pipeline, I created a second pipeline that only used the second job and hard coded the output collection from the first job into the second job. This allowed me to test the second leg of the pipeline without waiting for the first job to finish every development iteration.

From discussions, I realize this was a very bad way to develop and there are features in Arvados that should help with this development cycle. I used the examples from the tutorial as a basis for my pipeline template and those all have 'master' for their script_version and that's why my jobs all had 'master' as 'script_version'.

What is the recommended workflow to re-use jobs in my pipeline that I know should be re-used? I've heard that a potential workflow is to 'tag' different positions in the git repository history and then use these in the 'script_version' pipeline parameter to make sure legs of the job I don't want to be re-run will not be dependent on the latest check in. How do I do this? Is there a tutorial or some documentation I should be looking?

Also how do I transition my pipeline from the development phase to a production phase?

For completeness, here is the pipeline I have:

{

  "name": "Library Reference Tile Set Pipeline",

  "components": {

    "CreateBandedBedGraphFiles": {
      "script_parameters": {
        "input": {
          "required": true,
          "dataclass" : "Collection" 
        },
        "bigWigToBedGraph": "BIGWIGLOCATER",
        "cytoBand" : "UCSC.CYTOBAND",
        "createBandBedGraph" : "CREATEBANDBEDGRAPH" 
      },
      "repository": "$USER",
      "script_version": "master",
      "script": "referenceTileSetPipeline/create24merChoppedBedGraph.py" 
    },

    "ConstructTileSet" : {
      "script_parameters": {
        "input": {
          "output_of": "CreateBandedBedGraphFiles" 
        },
        "cytoBand" : "CYTOBAND",
        "buildTileSet" : "BUILDTILESET",
        "hg19.fa" : "HG19FA",
        "tileLength" : 200

      },
      "script_version": "master",
      "repository": "$USER",
      "script": "referenceTileSetPipeline/buildTileSet.py" 
    }

  }

}

and my kludge to only run the second job of the pipeline:

{

  "name": "Library Reference Tile Set Pipeline",

  "components": {

    "ConstructTileSet" : {
      "script_parameters": {
        "input": {
          "required" : "true",
          "dataclass": "Collection" 
        },
        "cytoBand" : "UCSC.CYTOBAND",
        "buildTileSet" : "BUILDTILESET",
        "echofile" : "ECHOFILE",
        "hg19.fa" : "HG19FA",
        "tileLength" : 200

      },
      "script_version": "master",
      "repository": "$USER",
      "script": "referenceTileSetPipeline/buildTileSet.py" 
    }

  }

}

In both, I have a shell script that does a subsititution for the field values that are all in caps and replaces the '$USER' string with my own username ('abram').

Related issues