Project

General

Profile

Python SDK » History » Version 5

Brett Smith, 08/28/2014 10:33 AM
add 3603 example and organization

1 1 Tom Clegg
h1. Python SDK
2
3
(design draft)
4
5 5 Brett Smith
h1. Hypothetical future Crunch scripts
6 1 Tom Clegg
7 5 Brett Smith
We're writing these out with the goal of designing a new SDK for Crunch script authors.
8
9
{{toc}}
10
11
h2. Example scripts
12
13
h3. grep+process example with annotations
14
15 1 Tom Clegg
<pre><code class="python">
16
#!/usr/bin/env python
17
18
from arvados import CrunchJob
19
20
import examplelib
21
import re
22
23
class NormalizeMatchingFiles(CrunchJob):
24
    @CrunchJob.task()
25
    def grep_files(self):
26
        # CrunchJob instantiates input parameters based on the
27
        # dataclass attribute.  When we ask for the input parameter,
28
        # CrunchJob sees that it's a Collection, and returns a
29
        # CollectionReader object.
30 3 Brett Smith
        input_coll = self.job_param('input')
31
        for filename in input_coll.filenames():
32
            self.grep_file(self.job_param('pattern'), input_coll, filename)
33 1 Tom Clegg
34
    @CrunchJob.task()
35 3 Brett Smith
    def grep_file(self, pattern, collection, filename):
36
        regexp = re.compile(pattern)
37
        with collection.open(filename) as in_file:
38 1 Tom Clegg
            for line in in_file:
39
                if regexp.search(line):
40 4 Brett Smith
                    self.normalize(in_file)
41 1 Tom Clegg
                    break
42
43
    # examplelib is already multi-threaded and will peg the whole
44
    # compute node.  These tasks should run sequentially.
45 4 Brett Smith
    # When tasks are created, Arvados-specific objects like Collection file
46
    # objects are serialized as task parameters.  CrunchJob instantiates
47
    # these parameters as real objects when it runs the task.
48 1 Tom Clegg
    @CrunchJob.task(parallel_with=[])
49 4 Brett Smith
    def normalize(self, coll_file):
50
        output = examplelib.frob(coll_file.mount_path())
51 1 Tom Clegg
        # self.output is a CollectionWriter.  When this task method finishes,
52
        # CrunchJob checks if we wrote anything to it.  If so, it takes care
53
        # of finishing the upload process, and sets this task's output to the
54
        # Collection UUID.
55
        with self.output.open(filename) as out_file:
56
            out_file.write(output)
57
58
59 2 Tom Clegg
if __name__ == '__main__':
60
    NormalizeMatchingFiles(task0='grep_files').main()
61 1 Tom Clegg
</code></pre>
62
63 5 Brett Smith
h3. Example from #3603
64
65
This is the script that Abram used to illustrate #3603.
66
67
<pre><code class="python">
68
#!/usr/bin/env python
69
70
from arvados import Collection, CrunchJob
71
from subprocess import check_call
72
73
class Example3603(CrunchJob):
74
    @CrunchJob.task()
75
    def parse_human_map(self):
76
        refpath = self.job_param('REFPATH').name
77
        for line in self.job_param('HUMAN_COLLECTION_LIST'):
78
            fastj_id, human_id = line.strip().split(',')
79
            self.run_ruler(refpath, fastj_id, human_id)
80
81
    @CrunchJob.task()
82
    def run_ruler(self, refpath, fastj_id, human_id):
83
        check_call(["tileruler", "--crunch", "--noterm", "abv",
84
                    "-human", human_id,
85
                    "-fastj-path", Collection(fastj_id).mount_path(),
86
                    "-lib-path", refpath])
87
        self.output.add('.')  # Or the path where tileruler writes output.
88
89
90
if __name__ == '__main__':
91
    Example3603(task0='parse_human_map').run()
92
</code></pre>
93
94
h2. Notes/TODO
95
96 2 Tom Clegg
* Important concurrency limits that job scripts must be able to express:
97
** Task Z cannot start until all outputs/side effects of tasks W, X, Y are known/complete (e.g., because Z uses WXY's outputs as its inputs).
98
** Task Y and Z cannot run on the same worker node without interfering with each other (e.g., due to RAM requirements).
99
* In general, the output name is not known until the task is nearly finished. Frequently it is clearer to specify it when the task is queued, though. We should provide a convenient way to do this without any boilerplate in the queued task.
100
* A second example that uses a "case" and "control" input (e.g., "tumor" and "normal") might help reveal features.
101 1 Tom Clegg
* Should get more clear about how the output of the job (as opposed to the output of the last task) is to be set. The obvious way (concatenate all task outputs) should be a one-liner, if not implicit. Either way, it should run in a task rather than being left up to @crunch-job@.