Distributed workflows » History » Version 2
Peter Amstutz, 03/07/2018 11:19 PM
1 | 1 | Peter Amstutz | h1. Distributed workflows |
---|---|---|---|
2 | |||
3 | h2. Problem description |
||
4 | |||
5 | A user wants to run a meta-analysis on data located on several different clusters. For either efficiency or legal reasons, the data should be analyzed in place and the results aggregated and returned to a central location. The user should be able to express the multi-cluster computation as a single CWL workflow, and no manual intervention should be required while the workflow is running. |
||
6 | |||
7 | h2. Simplifying assumptions |
||
8 | |||
9 | User explicitly indicates in the workflow which cluster a certain computation (data+code) happens. |
||
10 | |||
11 | Data transfer only occurs between the primary cluster and the secondary clusters, not between secondary clusters. |
||
12 | |||
13 | h2. Proposed solution |
||
14 | |||
15 | A workflow step can be given a CWL hint "RunOnCluster". This indicates the tool or subworkflow run by the workflow step should run on a specific Arvados cluster, instead of submitted to the cluster that the workflow runner is currently running on. The implementation of this would be similar to the "RunInSingleContainer" feature, constructing a container request to run the workflow runner on the remote cluster and wait for results. |
||
16 | |||
17 | In order for the workflow to run successfully on the remote cluster, it needs its data dependencies (docker images, scripts, reference data, etc). These are several options: |
||
18 | |||
19 | 2 | Peter Amstutz | # Don't do any data transfer of dependencies. Workflows will fail if dependencies are not available. User must manually transfer collections using arv-copy. |
20 | ** pros: least work |
||
21 | ** cons: terrible user experience. workflows that involve transferring intermediate results out to remote clusters don't work. |
||
22 | # Distribute dependencies as part of workflow registration (requires proactively distributing dependencies to every cluster that might ever need it). |
||
23 | ** pros: less burden on user than option (1) |
||
24 | ** cons: doesn't guarantee the dependencies are available where needed, --create/update-workflow option of arvados-cwl-runner has to orchestrate upload of data to every cluster in the federation |
||
25 | # Workflow runner determines which dependencies are missing from the remote cluster and pushes them before scheduling the subworkflow. |
||
26 | ** pros: no user intervention required, only copy data to clusters that we think will need it |
||
27 | ** cons: copies all dependencies regardless of whether they are actually used, requires that the primary runner have all the dependencies, or is able to facilitate transfer from some other cluster |
||
28 | 1 | Peter Amstutz | * Workflow runner on remote cluster determines which dependencies are missing and pulls them from federated peers on demand. |
29 | 2 | Peter Amstutz | ** pros: no user intervention required, only copy data we actually need |
30 | ** cons: requires that the primary runner have all the dependencies, or is able to facilitate transfer from some other cluster |
||
31 | * Federated access to collections, fetch data blocks on demand from another cluster |
||
32 | ** pros: only fetch data blocks that are actually needed, no collection record in the database |
||
33 | ** cons: requires federation infrastructure that isn't designed yet, requires some sort of caching proxy to avoid re-fetching the same block (for example, if 100 nodes are all trying to run a docker image from a federated collection). |
||
34 | 1 | Peter Amstutz | |
35 | 2 | Peter Amstutz | Options 2, 3 and 4 require roughly equivalent development work. |