Project

General

Profile

Distributed workflows » History » Version 7

Tom Clegg, 03/08/2018 07:32 PM

1 1 Peter Amstutz
h1. Distributed workflows
2
3
h2. Problem description
4
5 5 Peter Amstutz
A user wants to run a meta-analysis on data located on several different clusters.  For either efficiency or legal reasons, the data should be analyzed in place and the results aggregated and returned to a central location.  The user should be able to express the multi-cluster computation as a single CWL workflow, and no manual intervention (data transfer, etc) to complete the workflow.
6 1 Peter Amstutz
7
h2. Simplifying assumptions
8
9
User explicitly indicates in the workflow which cluster a certain computation (data+code) happens.
10
11
Data transfer only occurs between the primary cluster and the secondary clusters, not between secondary clusters.
12
13
h2. Proposed solution
14
15 3 Peter Amstutz
h3. Run subworkflow on cluster
16
17 1 Peter Amstutz
A workflow step can be given a CWL hint "RunOnCluster".  This indicates the tool or subworkflow run by the workflow step should run on a specific Arvados cluster, instead of submitted to the cluster that the workflow runner is currently running on.  The implementation of this would be similar to the "RunInSingleContainer" feature, constructing a container request to run the workflow runner on the remote cluster and wait for results.
18
19 3 Peter Amstutz
h3. Data transfer
20
21 1 Peter Amstutz
In order for the workflow to run successfully on the remote cluster, it needs its data dependencies (docker images, scripts, reference data, etc).  These are several options:
22 2 Peter Amstutz
23
# Don't do any data transfer of dependencies.  Workflows will fail if dependencies are not available.  User must manually transfer collections using arv-copy.
24 1 Peter Amstutz
** pros: least work
25 3 Peter Amstutz
** cons: terrible user experience.  workflow patterns that transfer data out to remote clusters don't work. 
26 2 Peter Amstutz
# Distribute dependencies as part of workflow registration (requires proactively distributing dependencies to every cluster that might ever need it).
27 3 Peter Amstutz
** pros: less burden on user compared to option (1)
28
** cons: doesn't guarantee the dependencies are available where needed, --create/update-workflow option of arvados-cwl-runner has to orchestrate upload of data to every cluster in the federation.  workflow patterns that transfer data out to remote clusters don't work.
29 1 Peter Amstutz
# Workflow runner determines which dependencies are missing from the remote cluster and pushes them before scheduling the subworkflow.
30
** pros: no user intervention required, only copy data to clusters that we think will need it
31
** cons: copies all dependencies regardless of whether they are actually used, requires that the primary runner have all the dependencies, or is able to facilitate transfer from some other cluster
32 3 Peter Amstutz
# Workflow runner on remote cluster determines which dependencies are missing and pulls them from federated peers on demand.
33 1 Peter Amstutz
** pros: no user intervention required, only copy data we actually need
34
** cons: requires that the primary runner have all the dependencies, or is able to facilitate transfer from some other cluster
35 3 Peter Amstutz
# Federated access to collections, fetch data blocks on demand from another cluster
36 7 Tom Clegg
** pros: only fetch data blocks that are actually needed, no additional copies of collections (although, like user records, they may be cached on the remote cluster)
37
** cons: requires building out federation support on the API server, requires provenance improvements (supply both UUID and PDH when submitting container requests), inefficient when substantial remote data is read many times (e.g., docker images) unless some sort of cross-cluster data caching is also added.
38 4 Peter Amstutz
# Hybrid federation, copy a collection to remote cluster but retain UUID/permission from source
39 3 Peter Amstutz
** pros: no user intervention, only fetch blocks we need, fetch data blocks from local keep if available, remote keep if necessary
40 4 Peter Amstutz
** cons: semantics/permission model for "cached" collection records are not yet defined.
41 1 Peter Amstutz
42 4 Peter Amstutz
h4. Notes
43 1 Peter Amstutz
44 4 Peter Amstutz
Options 1 and 2 cannot support workflows that involve some local computation, and then passing intermediate results to a remote cluster for computation.
45 3 Peter Amstutz
46 4 Peter Amstutz
Options 2, 3 and 4 involve a similar level of effort, mainly involving arvados-cwl-runner.  Of these, option 4 seems to cover the most use cases.  A general "transfer required collections" method will cover data transfer for dependencies, intermediate collections, and outputs.
47 3 Peter Amstutz
48 6 Tom Clegg
Option 5 involves adding federation-awareness to the Python/Go SDKs, arv-mount, crunch-run, and API server. In order to work efficiently when clusters are distant and large remote collections are accessed more than once per workflow, it will need an infrastructure-level cache solution.
49 4 Peter Amstutz
50
Option 6 level of effort is probably somewhere between options 4 and 5.
51 3 Peter Amstutz
52
h3. Outputs
53
54
Finally, after a subworkflow runs on a remote cluster, the primary cluster needs to access the output and possibly run additional steps.  This requires accessing a single output collection, either by pulling it to the primary cluster (using the same features supporting option 4), or by federation (options 5, 6).