Project

General

Profile

Actions

Idea #3603

closed

[Crunch] Design good Crunch task API, including considerations about "jobs-within-jobs" and "reusable tasks" ideas

Added by Abram Connelly over 9 years ago. Updated about 7 years ago.

Status:
Closed
Priority:
Normal
Assigned To:
-
Category:
Crunch
Target version:
-
Story points:
1.0

Description

I find there is a lot of boiler plate code that needs to be written in order to get even a basic crunch job running. As of this writing, there is a 'run-command' script which makes this process a lot easier but even the 'run-command' script is becoming unwieldy and doesn't always allow for easily written crunch jobs.

As a concrete example, I have a pipeline that expects a text file in a collection to have on each line a resource locator string and a "human ID" (comma separated). The first few lines of the file look as follows:

01ebb0ee6072d5d1274e5f805c520d38+51822,huEC6EEC
01f2b380198d9f2f8592e3eca2731b00+52431,huC434ED
039a116e865a63956dded36894dc7f20+52432,hu0D879F
04ba952fb67485b6c207db50cf9231eb+52433,huF1DC30
0527805fd792af51b89f7a693fb86f1a+52431,hu032C04
...

Each of the locator strings represents a collection with many files that the program in the pipeline will process.

The 'run-command' has a 'task.foreach' capability which can, for each input line, create a new task. This almost does what I want but since I have two fields that I want to pass into my program, I have to do a small processing step to parse the input to be passed into the program I want to run.

Using 'run-command', I have written a 'shim' script that takes in parameters on the command line and the executes the program.

Here is the relevant portion of the template with the 'shim' script put in:

            ...
            "script": "run-command",
            "script_parameters": {
                "command": [
                    "$(job.srcdir)/crunch_scripts/shim",
                    "$(fj_human)",
                    "$(job.srcdir)/crunch_scripts/bin/tileruler",
                    "$(file $(REFPATH))",
                    "$(job.srcdir)/crunch_scripts/bin/lz4" 
                ],

                "fj_human" : "$(file $(HUMAN_COLLECTION_LIST))",
                "task.foreach": "fj_human",

                "HUMAN_COLLECTION_LIST": {
                    "required": true,
                    "dataclass": "File" 
                },

                "REFPATH" : {
                    "required": true,
                    "dataclass": "File" 
                }
            },
            ...

And the shim script for completeness:

#!/bin/bash

fj_human=$1
tileruler=$2
refpath=$3
lz4=$4

fj_uuid=` echo $fj_human | cut -f1 -d',' `
huname=` echo $fj_human | cut -f2 -d',' `

$tileruler --crunch --noterm abv -human $huname -fastj-path $TASK_KEEPMOUNT/$fj_uuid -lib-path $refpath

One could imagine extending the 'run-command' to include more options to help facilitate this type of workflow, but I think the deeper issue is providing a simpler SDK for common environments.

For example, here is what I would imagine a template looking like:

            ...
            "script": "myjob",
            "script_parameters": {
                "HUMAN_COLLECTION_LIST": {
                    "required": true,
                    "dataclass": "File" 
                },
                "REFPATH" : {
                    "required": true,
                    "dataclass": "File" 
                }
            },
            ...

And a hypothetical bash 'myjob' script:

#!/usr/bin/arvenv

for x in `cat $HUMAN_COLLECTION_LIST`
do
  fj_uuid=`echo $x | cut -f1 -d,`
  huname=`echo $x | cut -f2 -d,`

  arvqueuetask arvenv tileruler --crunch --noterm abv -human $huname -fastj-path $TASK_KEEPMOUNT/$fj_uuid -lib-path $REFPATH
done

Here is a hypothetical "myjob" Python script to do the same:

#!/usr/bin/python

import os
import arvados as arv

job = arv.this_job()
input_collection = job["script_parameters"]["HUMAN_COLLECTION_LIST"]
refpath = job["script_parameters"]["REFPATH"]

with open( input_collection ) as f:
  for line in f.readlines():
    line = line.strip()
    fj_uuid, huname = line.split(',')
    arv.queuetask( [ "arvenv", "tileruler", "--crunch", "--noterm", "abv", "-human", huname, "-fastj-path", arv.keepmount() + "/" + fj_uuid, "-lib-path", refpath ] )

Where 'arvenv' could be an enhanced version of 'run-command' or something else that's smart about setting up the environment.

Both of the hypothetical scripts might seem a bit short but I believe they are much more in line with what people (and me) expect these type of 'adaptor' scripts to look like.

Making these scripts with the Python SDK would require at least two pages of boiler plate code. The 'run-command' script helps reduce boiler plate but, in my opinion, at the cost of versatility and readability. Both of the above scripts only really need some access to variables passed as specified in the template and need an easily accessible helper functions to arvados functionality which, in this case, is the ability to create tasks easily.

All of the above uses the assumptions that 'run-command' makes, such as making the current working directory the 'output' directory and any files created in the current working directory will be automatically put into a collection at job/task end.


Subtasks 2 (0 open2 closed)

Task #5031: Review/feedbackClosedTom Clegg08/27/2014Actions
Task #3718: Hash out desired API with science teamClosed08/27/2014Actions

Related issues

Related to Arvados - Feature #4528: [Crunch] Dynamic task allocation based on job size determined at runtimeClosed11/14/2014Actions
Related to Arvados - Feature #4561: [SDKs] Refactor run-command so it can be used as an SDK by scripts in a git treeClosedActions
Precedes Arvados - Idea #3347: [Crunch] Run dev (and real) jobs using a syntax as close as possible to "python foo.py input.txt"Closed08/28/201408/28/2014Actions
Actions

Also available in: Atom PDF