Project

General

Profile

Python SDK » History » Version 15

Brett Smith, 01/22/2015 07:33 PM
clarifications per #3603 note-18

1 1 Tom Clegg
h1. Python SDK
2
3 10 Brett Smith
This is a design draft for a next-generation Python SDK, making it easier to write Crunch scripts today, with room to grow for future workflow changes in Arvados.
4 1 Tom Clegg
5 10 Brett Smith
{{toc}}
6 1 Tom Clegg
7 10 Brett Smith
h2. Goals
8 1 Tom Clegg
9 10 Brett Smith
As I understand them, in order of importance:
10 1 Tom Clegg
11 10 Brett Smith
* The SDK should provide a high-level, Pythonic interface to dispatch work through Arvados.  It should eliminate any need to manipulate jobs or tasks directly via API calls, or manually instantiate objects from parameter strings.  There should be a bare minimum of boilerplate code in a user's Crunch script—just the tiny bit necessary to use the SDK.
12
* The SDK should be usable with today's Crunch, so users can start working with it and give us feedback to inform future development.
13
* The SDK should have clear room to grow to accommodate future workflow improvements, whether that's new Crunch features (including possibly reusable tasks and/or elimination of the job/task distinction), hooks with the Common Workflow Language, or whatever.
14
  The first version of the SDK will not support features we expect these systems to have—that would conflict with the previous goal.  The idea here is that the API should not rely on details of today's Crunch implementation, and should have clear spaces where users will be able to use the features of more powerful work dispatch systems in the future.
15 1 Tom Clegg
16 10 Brett Smith
h2. Components
17 1 Tom Clegg
18 10 Brett Smith
h3. Dispatcher
19 1 Tom Clegg
20 14 Brett Smith
Dispatcher classes provide a consistent interface to talk to different scheduling systems: they know how to run the right code for this instantiation (e.g., the current task); how to dispatch new work to the scheduling system; how to de/serialize parameters; etc.
21 1 Tom Clegg
22 10 Brett Smith
Early on, we'll have a traditional Crunch dispatcher, and a "running locally for debugging" dispatcher.  If we want to support entirely different backends in the future, the Dispatcher interface should provide the right level of abstraction to do that.  While I'm not sketching out an API here (since it's not user-facing), we do need to be careful to make sure that Dispatcher methods hit that design target.
23 1 Tom Clegg
24 14 Brett Smith
Users can instantiate Dispatcher classes to customize their behavior.  For example, one thing we know we want to support early on is telling Crunch how to set the job output from given tasks.  Since the job/task distinction is specific to Crunch v1, that customization belongs in the Crunch Dispatcher class.  We'll let users pass in task names to specify what output should become the job output:
25
26
@dispatcher = CrunchDispatcher(job_output_from=['taskB', 'taskD'])@
27
28
However, all Dispatchers should be able to run without any customization, and users shouldn't have to deal with them at all for the easy cases where they're fine with the scheduler's default behavior.
29
30 1 Tom Clegg
h3. CrunchScript
31 10 Brett Smith
32 1 Tom Clegg
The bulk of a user's Crunch script will define a subclass of the SDK's CrunchScript class.  The user extends it with WorkMethods that represent work units, and a start() method that implements code to start the work.  That done, the user calls subclass.run() to get the appropriate code running.
33
34 14 Brett Smith
The run() class method accepts the Dispatcher to use in the @dispatcher@ keyword argument.  If none is provided, it introspects the environment to find the appropriate Dispatcher class to use.  It instantiates that, passing along an instantiation of itself.  To better illustrate, I'll describe how the current Crunch dispatcher would take over from here.
35 1 Tom Clegg
36
The Crunch Dispatcher gets the current task.  If that's task 0, it gets the current job, deserializes the job parameters, and calls subclass.start() with those arguments.  For any other task, it looks up the task's corresponding WorkMethod, and runs its code with deserialized parameters.
37 11 Brett Smith
38 13 Brett Smith
Either way, the method's return value is captured.  If it's a Collection object, the Dispatcher makes sure it's saved, then records the portable data hash as the task's output.  For any other return value except None, it serializes the return value to JSON, saves that in a Collection, and uses that as the task's output.
39 10 Brett Smith
40
h3. WorkMethod
41
42
WorkMethod is a class that should decorate instance methods defined on the CrunchScript subclass.  It demarcates work that can be scheduled through a Dispatcher.  Use looks like:
43
44
<pre><code class="python">@WorkMethod()
45
def task_code(self, param1, param2):
46
    # implementation
47 1 Tom Clegg
</code></pre>
48
49 10 Brett Smith
When a WorkMethod is called—from start() or another WorkMethod's code—the user's code isn't run.  Instead, we give the Dispatcher all the information about the call.  The Dispatcher serializes the parameters and schedules the real work method to be run later.  We wrap information about the work's scheduling in a FutureOutput object, which I'll get to next.
50 1 Tom Clegg
51 15 Brett Smith
Because the Dispatcher has to be able to serialize arguments and output, those values can only have types that the SDK knows how to work with, and some type information may be lost.  For the first SDK implementation, objects we can work with will be objects that can be encoded to JSON, plus Collection objects.  Sequence types that aren't native to JSON, like tuples and iterators, will become lists at the method's boundaries.
52
53 10 Brett Smith
Note that WorkMethod accepts arguments, and we're passing it none.  We expect the constructor to grow optional keyword arguments over time that let the user specify scheduling constraints.
54
55
h3. FutureOutput
56 1 Tom Clegg
57 15 Brett Smith
FutureOutput objects represent the output of work that has been scheduled but not yet run.  Whenever a WorkMethod is called, it returns a FutureOutput object.  The only thing the user can do with this object is pass it in as an argument to other WorkMethod calls.  When that happens, the Dispatcher does two things.  First, it introspects all the FutureObjects passed in to figure out how the new work should be scheduled.  Second, when it's time to run the work of the second WorkMethod, it finds the real output generated by running the first WorkMethod, and passes that in when running the user code.  To be clear: each Dispatcher will have its own paired FutureObject implementation.
58
59
FutureOutput objects cannot be resolved or manipulated by the user directly.  Users can't wait for the real output to become available (because in some schedulers, like current Current, it never will be).  The objects should have no user-facing methods that might suggest this is possible.  We'll probably even redefine some very basic methods (e.g., @__str__@) to be _less_ useful than object's default implementation, to really drive this point home.
60 10 Brett Smith
61
Let me talk about a Crunch example to make things concrete.  Roughly, the FutureObject will wrap the task we create to represent the work.  Imagine a simple linear Crunch job:
62
63
<pre><code class="python">step1out = task1(arg1, arg2)
64
task2(step1out, arg)</code></pre>
65
66
When task2 is called, the Crunch Dispatcher introspects the arguments, and finds the step1out FutureObject among them.  The task2 code can't be run until task1 has finished.  The Crunch Dispatcher gets task1's sequence number from step1out, and creates a task to run task2 with a greater sequence number.
67
68
When Crunch actually starts the compute work for task2, the Crunch Dispatcher will see that this argument should have the value of task1's output.  It will deserialize that from the task output mirroring the rules described earlier, then pass that object in as the value when it calls the user's code.
69
70
h3. Review
71
72
Here are the interfaces that script authors will use:
73
74
* They will subclass CrunchScript to define their own start() method and WorkMethods.
75
* They will decorate their work methods with WorkMethod().  This decorator will not currently accept any arguments, but you'll still call it as if it did, so it can grow to accept new keyword arguments in the future.
76
* They will receive FutureOutput objects when they call WorkMethods, and pass those to other WorkMethod calls.  The user doesn't need to know any of the semantics of FutureOutput at all; only the Dispatcher does.
77
* The main body of the script will call the run() class method on their CrunchScript subclass.
78
79
This is a very small interface.  There's nothing Crunch-specific in it.  When the library takes responsibility for some Arvados interface, it takes responsibility for all interactions with it: e.g., it both serializes and deserializes parameters; it controls all aspects of work scheduling.  Users shouldn't have to encode any Crunch specifics in their scripts.  I think that bodes well for future development of the SDK.
80
81
h2. Hypothetical future Crunch scripts
82
83
These are completely untested, since they can't actually run.  Please forgive cosmetic mistakes.
84
85 6 Brett Smith
h3. Example: Crunch statistics on a Collection of fastj files indicating a tumor
86
87
This demonstrates scheduling tasks in order and explicit job output.
88
89
<pre><code class="python">
90
#!/usr/bin/env python
91 1 Tom Clegg
92 6 Brett Smith
import os
93
import pprint
94 10 Brett Smith
import subprocess
95 6 Brett Smith
96 14 Brett Smith
from arvados.crunch import CrunchScript, WorkMethod, CrunchDispatcher
97 6 Brett Smith
from subprocess import check_call
98
99 10 Brett Smith
class TumorAnalysis(CrunchScript):
100
    # input is a Collection named in a job parameter.
101
    # The SDK instantiates the corresponding Colection object and passes it in.
102
    def start(self, input):
103
        # analyze_tumors is being called with a generator of
104
        # FutureOutput objects.  The Dispatcher will unroll this (it
105
        # inevitably needs to traverse recursive data structures),
106
        # schedule all the classify tasks at sequence N+1, then
107
        # schedule analyze_tumors after at sequence N+2.
108
        self.analyze_tumors(self.classify(in_file)
109
                            for in_file in input.all_files()
110
                            if in_file.name.endswith('.fastj'))
111 6 Brett Smith
112 10 Brett Smith
    @WorkMethod()
113
    def classify(self, in_file):
114
        out_file = tempfile.NamedTemporaryFile()
115
        proc = subprocess.Popen(['normal_or_tumor'],
116
                                stdin=subprocess.PIPE, stdout=out_file)
117
        for data in in_file.readall():
118
            proc.stdin.write(data)
119
        proc.stdin.close()
120
        proc.wait()
121
        if 'tumor' in outfile.read(4096):
122
            out_file.seek(0)
123
            out_coll = CollectionWriter()
124
            with out_coll.open(in_file.name) as output:
125
                output.write(out_file.read())
126
            return out_coll
127
        return None
128 5 Brett Smith
129 10 Brett Smith
    @WorkMethod()
130
    def analyze_tumors(self, results):
131 5 Brett Smith
        compiled = {}
132 10 Brett Smith
        for result in results:
133
            if result is None:
134
                continue
135
            # Imagine a reduce-type step operating on the file in the
136 1 Tom Clegg
            # Collection.
137
            compiled[thing] = ...
138 10 Brett Smith
        return compiled
139
140 5 Brett Smith
141 1 Tom Clegg
if __name__ == '__main__':
142 14 Brett Smith
    dispatcher = CrunchDispatcher(job_output_from=['analyze_tumors'])
143
    TumorAnalysis.run(dispatcher=dispatcher)
144 5 Brett Smith
</code></pre>
145
146 12 Brett Smith
h3. Example from #3603
147 5 Brett Smith
148
This is the script that Abram used to illustrate #3603.
149
150
<pre><code class="python">
151
#!/usr/bin/env python
152
153 12 Brett Smith
from arvados.crunch import CrunchScript, WorkMethod
154 1 Tom Clegg
from subprocess import check_call
155 5 Brett Smith
156 12 Brett Smith
class Example3603(CrunchScript):
157
    def start(self, human_collection_list, refpath):
158
        for line in human_collection_list:
159 1 Tom Clegg
            fastj_id, human_id = line.strip().split(',')
160
            self.run_ruler(refpath, fastj_id, human_id)
161
162 12 Brett Smith
    @WorkMethod()
163 1 Tom Clegg
    def run_ruler(self, refpath, fastj_id, human_id):
164 12 Brett Smith
        # CollectionReader and StreamFileReader objects should grow a
165
        # mount_path() method that gives the object's location under
166
        # $TASK_KEEPMOUNT.
167 1 Tom Clegg
        check_call(["tileruler", "--crunch", "--noterm", "abv",
168 12 Brett Smith
                    "-human", human_id,
169
                    "-fastj-path", CollectionReader(fastj_id).mount_path(),
170
                    "-lib-path", refpath])
171 1 Tom Clegg
        out_coll = CollectionWriter()
172
        # The argument is the path where tileruler writes output.
173
        out_coll.write_directory_tree('.')
174
        return out_coll
175
176 12 Brett Smith
177
if __name__ == '__main__':
178 14 Brett Smith
    # Note that there will be many run_ruler tasks.  Their output will be
179
    # concatenated.  The default behavior, as usual, is to concatenate
180
    # everything.
181
    dispatcher = CrunchDispatcher(job_output_from=['run_ruler'])
182
    Example3603.run(dispatcher=dispatcher)
183 13 Brett Smith
</code></pre>
184 1 Tom Clegg
185 12 Brett Smith
h2. Notes/TODO
186 14 Brett Smith
187
(Some of these may be out-of-date, but I'll leave that for the suggesters to decide.)
188 1 Tom Clegg
189
* Important concurrency limits that job scripts must be able to express:
190
** Task Z cannot start until all outputs/side effects of tasks W, X, Y are known/complete (e.g., because Z uses WXY's outputs as its inputs).
191
*** I'm considering having task methods immediately return some kind of future object.  You could pass that in to other task methods, and the SDK would introspect that and schedule the second task to come after the first.  This would also make it look more like normal Python: a higher-level function can call one task method, grab its output, and pass it in to another.  Probably the only catch is that we wouldn't be using the original method's real return value, but instead the task's output collection.  When we run the second task, what was represented by a future can be replaced with a CollectionReader object for the task's output.
192
** Task Y and Z cannot run on the same worker node without interfering with each other (e.g., due to RAM requirements).
193
** Schedule at most N instance of task T on the same node.  (The @parallel_with=[]@ example above is stronger than we actually want, for a multithreaded task.  It would be okay to run several of these tasks in parallel, as long as there's only one per node.  This could be expressed as a particular case of the above bullet: "Task T cannot run on the same worker node as task T without interference."  But that feels less natural.)
194
*** This is an SDK for Crunch, and currently Crunch gives the user no control of task scheduling beyond the sequence, like how one compute node can or can't be shared.  I don't know how to write these features given that there's no way to express these constraints in a way Crunch will recognize and expect (beyond the brute-force "don't parallelize this").
195
* In general, the output name is not known until the task is nearly finished. Frequently it is clearer to specify it when the task is queued, though. We should provide a convenient way to do this without any boilerplate in the queued task.
196
** This seems pretty easily doable by passing the desired filename into the task method as a parameter.  Then the code can work with @self.output.open(passed_filename)@.
197
* A second example that uses a "case" and "control" input (e.g., "tumor" and "normal") might help reveal features.
198
* Should get more clear about how the output of the job (as opposed to the output of the last task) is to be set. The obvious way (concatenate all task outputs) should be a one-liner, if not implicit. Either way, it should run in a task rather than being left up to @crunch-job@.
199
** This also sounds like something that needs to be settled at the Crunch level.  But it seems like we could be pretty accommodating by having crunch-job just say, "If the job set an output, great; if not, collate the task outputs."  Then the SDK can provide pre-defined task methods for common output setups: collate the output of all tasks or a subset of them; use the output of a specific task; etc.