Distributed workflows » History » Version 3
Peter Amstutz, 03/08/2018 02:51 AM
1 | 1 | Peter Amstutz | h1. Distributed workflows |
---|---|---|---|
2 | |||
3 | h2. Problem description |
||
4 | |||
5 | A user wants to run a meta-analysis on data located on several different clusters. For either efficiency or legal reasons, the data should be analyzed in place and the results aggregated and returned to a central location. The user should be able to express the multi-cluster computation as a single CWL workflow, and no manual intervention should be required while the workflow is running. |
||
6 | |||
7 | h2. Simplifying assumptions |
||
8 | |||
9 | User explicitly indicates in the workflow which cluster a certain computation (data+code) happens. |
||
10 | |||
11 | Data transfer only occurs between the primary cluster and the secondary clusters, not between secondary clusters. |
||
12 | |||
13 | h2. Proposed solution |
||
14 | |||
15 | 3 | Peter Amstutz | h3. Run subworkflow on cluster |
16 | |||
17 | 1 | Peter Amstutz | A workflow step can be given a CWL hint "RunOnCluster". This indicates the tool or subworkflow run by the workflow step should run on a specific Arvados cluster, instead of submitted to the cluster that the workflow runner is currently running on. The implementation of this would be similar to the "RunInSingleContainer" feature, constructing a container request to run the workflow runner on the remote cluster and wait for results. |
18 | |||
19 | 3 | Peter Amstutz | h3. Data transfer |
20 | |||
21 | 1 | Peter Amstutz | In order for the workflow to run successfully on the remote cluster, it needs its data dependencies (docker images, scripts, reference data, etc). These are several options: |
22 | 2 | Peter Amstutz | |
23 | # Don't do any data transfer of dependencies. Workflows will fail if dependencies are not available. User must manually transfer collections using arv-copy. |
||
24 | 1 | Peter Amstutz | ** pros: least work |
25 | 3 | Peter Amstutz | ** cons: terrible user experience. workflow patterns that transfer data out to remote clusters don't work. |
26 | 2 | Peter Amstutz | # Distribute dependencies as part of workflow registration (requires proactively distributing dependencies to every cluster that might ever need it). |
27 | 3 | Peter Amstutz | ** pros: less burden on user compared to option (1) |
28 | ** cons: doesn't guarantee the dependencies are available where needed, --create/update-workflow option of arvados-cwl-runner has to orchestrate upload of data to every cluster in the federation. workflow patterns that transfer data out to remote clusters don't work. |
||
29 | 1 | Peter Amstutz | # Workflow runner determines which dependencies are missing from the remote cluster and pushes them before scheduling the subworkflow. |
30 | ** pros: no user intervention required, only copy data to clusters that we think will need it |
||
31 | ** cons: copies all dependencies regardless of whether they are actually used, requires that the primary runner have all the dependencies, or is able to facilitate transfer from some other cluster |
||
32 | 3 | Peter Amstutz | # Workflow runner on remote cluster determines which dependencies are missing and pulls them from federated peers on demand. |
33 | 1 | Peter Amstutz | ** pros: no user intervention required, only copy data we actually need |
34 | ** cons: requires that the primary runner have all the dependencies, or is able to facilitate transfer from some other cluster |
||
35 | 3 | Peter Amstutz | # Federated access to collections, fetch data blocks on demand from another cluster |
36 | ** pros: only fetch data blocks that are actually needed, no collection record copy in remote database |
||
37 | ** cons: requires SDK improvements to handle multiple clusters, requires caching proxy to avoid re-fetching the same block (for example, if 100 nodes are all trying to run a docker image from a federated collection). |
||
38 | # Hybrid federation, copy a collection to remote cluster but retains UUID/permission from source |
||
39 | ** pros: no user intervention, only fetch blocks we need, fetch data blocks from local keep if available, remote keep if necessary |
||
40 | ** cons: semantics of "cached" collection records need to be defined. |
||
41 | 1 | Peter Amstutz | |
42 | 3 | Peter Amstutz | Note that options 1 and 2 don't support workflows that involve some local computation, and then passing intermediate results to a remote cluster for computation. |
43 | |||
44 | Options 2, 3 and 4 probably involve a similar level of effort, mainly involving arvados-cwl-runner. |
||
45 | |||
46 | Option 5 involves adding federation features to the Python/Go SDKs, arv-mount, and crunch-run. It also may require a new caching proxy service for keep to avoid redundantly transferring the same blocks over the internet. |
||
47 | |||
48 | Option 6 requires combined work of both options 4 and 5. |
||
49 | |||
50 | h3. Outputs |
||
51 | |||
52 | Finally, after a subworkflow runs on a remote cluster, the primary cluster needs to access the output and possibly run additional steps. This requires accessing a single output collection, either by pulling it to the primary cluster (using the same features supporting option 4), or by federation (options 5, 6). |