Project

General

Profile

Distributed workflows » History » Version 2

Peter Amstutz, 03/07/2018 11:19 PM

1 1 Peter Amstutz
h1. Distributed workflows
2
3
h2. Problem description
4
5
A user wants to run a meta-analysis on data located on several different clusters.  For either efficiency or legal reasons, the data should be analyzed in place and the results aggregated and returned to a central location.  The user should be able to express the multi-cluster computation as a single CWL workflow, and no manual intervention should be required while the workflow is running.
6
7
h2. Simplifying assumptions
8
9
User explicitly indicates in the workflow which cluster a certain computation (data+code) happens.
10
11
Data transfer only occurs between the primary cluster and the secondary clusters, not between secondary clusters.
12
13
h2. Proposed solution
14
15
A workflow step can be given a CWL hint "RunOnCluster".  This indicates the tool or subworkflow run by the workflow step should run on a specific Arvados cluster, instead of submitted to the cluster that the workflow runner is currently running on.  The implementation of this would be similar to the "RunInSingleContainer" feature, constructing a container request to run the workflow runner on the remote cluster and wait for results.
16
17
In order for the workflow to run successfully on the remote cluster, it needs its data dependencies (docker images, scripts, reference data, etc).  These are several options:
18
19 2 Peter Amstutz
# Don't do any data transfer of dependencies.  Workflows will fail if dependencies are not available.  User must manually transfer collections using arv-copy.
20
** pros: least work
21
** cons: terrible user experience.  workflows that involve transferring intermediate results out to remote clusters don't work.
22
# Distribute dependencies as part of workflow registration (requires proactively distributing dependencies to every cluster that might ever need it).
23
** pros: less burden on user than option (1)
24
** cons: doesn't guarantee the dependencies are available where needed, --create/update-workflow option of arvados-cwl-runner has to orchestrate upload of data to every cluster in the federation
25
# Workflow runner determines which dependencies are missing from the remote cluster and pushes them before scheduling the subworkflow.
26
** pros: no user intervention required, only copy data to clusters that we think will need it
27
** cons: copies all dependencies regardless of whether they are actually used, requires that the primary runner have all the dependencies, or is able to facilitate transfer from some other cluster
28 1 Peter Amstutz
* Workflow runner on remote cluster determines which dependencies are missing and pulls them from federated peers on demand.
29 2 Peter Amstutz
** pros: no user intervention required, only copy data we actually need
30
** cons: requires that the primary runner have all the dependencies, or is able to facilitate transfer from some other cluster
31
* Federated access to collections, fetch data blocks on demand from another cluster
32
** pros: only fetch data blocks that are actually needed, no collection record in the database
33
** cons: requires federation infrastructure that isn't designed yet, requires some sort of caching proxy to avoid re-fetching the same block (for example, if 100 nodes are all trying to run a docker image from a federated collection).
34 1 Peter Amstutz
35 2 Peter Amstutz
Options 2, 3 and 4 require roughly equivalent development work.