Project

General

Profile

Distributed workflows » History » Version 4

Peter Amstutz, 03/08/2018 03:07 PM

1 1 Peter Amstutz
h1. Distributed workflows
2
3
h2. Problem description
4
5
A user wants to run a meta-analysis on data located on several different clusters.  For either efficiency or legal reasons, the data should be analyzed in place and the results aggregated and returned to a central location.  The user should be able to express the multi-cluster computation as a single CWL workflow, and no manual intervention should be required while the workflow is running.
6
7
h2. Simplifying assumptions
8
9
User explicitly indicates in the workflow which cluster a certain computation (data+code) happens.
10
11
Data transfer only occurs between the primary cluster and the secondary clusters, not between secondary clusters.
12
13
h2. Proposed solution
14
15 3 Peter Amstutz
h3. Run subworkflow on cluster
16
17 1 Peter Amstutz
A workflow step can be given a CWL hint "RunOnCluster".  This indicates the tool or subworkflow run by the workflow step should run on a specific Arvados cluster, instead of submitted to the cluster that the workflow runner is currently running on.  The implementation of this would be similar to the "RunInSingleContainer" feature, constructing a container request to run the workflow runner on the remote cluster and wait for results.
18
19 3 Peter Amstutz
h3. Data transfer
20
21 1 Peter Amstutz
In order for the workflow to run successfully on the remote cluster, it needs its data dependencies (docker images, scripts, reference data, etc).  These are several options:
22 2 Peter Amstutz
23
# Don't do any data transfer of dependencies.  Workflows will fail if dependencies are not available.  User must manually transfer collections using arv-copy.
24 1 Peter Amstutz
** pros: least work
25 3 Peter Amstutz
** cons: terrible user experience.  workflow patterns that transfer data out to remote clusters don't work. 
26 2 Peter Amstutz
# Distribute dependencies as part of workflow registration (requires proactively distributing dependencies to every cluster that might ever need it).
27 3 Peter Amstutz
** pros: less burden on user compared to option (1)
28
** cons: doesn't guarantee the dependencies are available where needed, --create/update-workflow option of arvados-cwl-runner has to orchestrate upload of data to every cluster in the federation.  workflow patterns that transfer data out to remote clusters don't work.
29 1 Peter Amstutz
# Workflow runner determines which dependencies are missing from the remote cluster and pushes them before scheduling the subworkflow.
30
** pros: no user intervention required, only copy data to clusters that we think will need it
31
** cons: copies all dependencies regardless of whether they are actually used, requires that the primary runner have all the dependencies, or is able to facilitate transfer from some other cluster
32 3 Peter Amstutz
# Workflow runner on remote cluster determines which dependencies are missing and pulls them from federated peers on demand.
33 1 Peter Amstutz
** pros: no user intervention required, only copy data we actually need
34
** cons: requires that the primary runner have all the dependencies, or is able to facilitate transfer from some other cluster
35 3 Peter Amstutz
# Federated access to collections, fetch data blocks on demand from another cluster
36
** pros: only fetch data blocks that are actually needed, no collection record copy in remote database
37
** cons: requires SDK improvements to handle multiple clusters, requires caching proxy to avoid re-fetching the same block (for example, if 100 nodes are all trying to run a docker image from a federated collection).
38 4 Peter Amstutz
# Hybrid federation, copy a collection to remote cluster but retain UUID/permission from source
39 3 Peter Amstutz
** pros: no user intervention, only fetch blocks we need, fetch data blocks from local keep if available, remote keep if necessary
40 4 Peter Amstutz
** cons: semantics/permission model for "cached" collection records are not yet defined.
41 1 Peter Amstutz
42 4 Peter Amstutz
h4. Notes
43 1 Peter Amstutz
44 4 Peter Amstutz
Options 1 and 2 cannot support workflows that involve some local computation, and then passing intermediate results to a remote cluster for computation.
45 3 Peter Amstutz
46 4 Peter Amstutz
Options 2, 3 and 4 involve a similar level of effort, mainly involving arvados-cwl-runner.  Of these, option 4 seems to cover the most use cases.  A general "transfer required collections" method will cover data transfer for dependencies, intermediate collections, and outputs.
47 3 Peter Amstutz
48 4 Peter Amstutz
Option 5 involves adding federation features to the Python/Go SDKs, arv-mount, and crunch-run.  It also may require a new keep infrastructure such as a caching proxy service for keep to avoid redundantly transferring the same blocks over the internet.
49
50
Option 6 level of effort is probably somewhere between options 4 and 5.
51 3 Peter Amstutz
52
h3. Outputs
53
54
Finally, after a subworkflow runs on a remote cluster, the primary cluster needs to access the output and possibly run additional steps.  This requires accessing a single output collection, either by pulling it to the primary cluster (using the same features supporting option 4), or by federation (options 5, 6).