Story #10344

[Workbench] Import CWL workflow

Added by Tom Morris over 2 years ago. Updated over 1 year ago.

Status:
New
Priority:
Normal
Assigned To:
-
Category:
-
Target version:
Start date:
10/25/2016
Due date:
% Done:

0%

Estimated time:
Story points:
-

Related issues

Related to Arvados - Story #13080: Provider users the ability to do entire workflow building process through the web interfaceNew

History

#1 Updated by Peter Amstutz over 2 years ago

Possible approach:

  • Upload CWL files to collection via web interface.
  • User clicks on "register workflow" and gets a file picker
  • Workbench fetches collection into a temp directory and runs arvados-cwl-runner on the backend to register workflow.
  • Workflow record & container request creation remains the same.
  • Unclear how/whether to update workflow record when collection is updated, creates synchronization problem likely to lead to user confusion

Alternate solution:

  • Upload CWL files to collection via web interface.
  • User clicks on "register workflow" and gets a file picker
  • Workbench adds a link object pointing to the collection to indicate it stores a workflow.
  • current "workflows" table is redundant and can be eliminated.
  • Run a workflow picker queries for link_class "workflow"
  • Workbench fetches collection to generate input editing UI
  • No synchronization problem (user just updates collection)
  • Feature can easily be extended to support workflows stored in git repositories in the future

#2 Updated by Tom Clegg over 2 years ago

Another possibility:
  • Click "create workflow" button (on workflows#index or ...)
  • Choose files from local filesystem
  • Build workflow object (in client-side JS)
  • Do workflows#create call directly to API server
  • Refresh page to make new workflow shows up

(How much of arvados-cwl-runner needs to be ported to JS in order to make this happen?)

#3 Updated by Peter Amstutz over 2 years ago

Tom Clegg wrote:

Another possibility:
  • Click "create workflow" button (on workflows#index or ...)
  • Choose files from local filesystem
  • Build workflow object (in client-side JS)
  • Do workflows#create call directly to API server
  • Refresh page to make new workflow shows up

(How much of arvados-cwl-runner needs to be ported to JS in order to make this happen?)

arvworkflow.upload_workflow:

  • Finds referenced Docker images and uploads them (impossible from browser)
  • Traverses document dependencies and packs them into a single document (would need to port dependency scanning and document packing)
    • Alternately, if we store the files in a collection, we don't have to do the packing, just the scanning
  • Document can have non-CWL dependencies (e.g. python scripts used by the workflow), these also have to be uploaded to a collection and references in CWL file updated
    • Alternately, if we store the files in a collection, we just have to ensure that relative references are maintained.

So, this strategy is more viable under the "alternate solution" case where we store the workflow files as-is in a collection instead of storing compound documents in the 'workflows' table. This would be a better UX than requiring the user to select each file separately. However, we also need to examine browser security policy around accessing the file system.

#4 Updated by Peter Amstutz over 2 years ago

Here's another idea. CWL files are placed in git and discovered automatically.

  • Gitolite post-update hook
  • Scan updated branch for Dockerfile
  • docker build
  • Scan repo for CWLFile, Dockstore.cwl
  • Create or update(?) workflow records for each one with arvados-cwl-runner --create-workflow
  • use link to connect repo+branch/tag with workflow record

Benefits:

  • Provide version tracking, provenance for CWL, Docker files -> Workflow
  • Best user experience (work locally, push to git, workflow automatically updates)
  • Can already view git repositories in workbench
  • Does not require any workbench changes
  • Can use repository layout/conventions that are compatible with Dockstore, make it easier for users to publish their dockerfiles/workflows

Considerations:

  • Where does the registration service run (is it a subprocess forked from gitolite, or a separate service)
  • How to return messages/errors to user
  • Assumes user ability to use git
  • Must be documented (but shouldn't be very hard, could add explicit links to documentation from workbench)

#5 Updated by Tom Clegg over 2 years ago

Peter Amstutz wrote:

examine browser security policy around accessing the file system.

FileReader API lets us do this, provided the files have been selected by the user with an <input type=file> widget.

#6 Updated by Peter Amstutz over 2 years ago

Tom Clegg wrote:

Peter Amstutz wrote:

examine browser security policy around accessing the file system.

FileReader API lets us do this, provided the files have been selected by the user with an <input type=file> widget.

Right, so the user has to explicitly select each file via input widget or drop target, so dependency scanning doesn't really work.

#7 Updated by Peter Amstutz over 2 years ago

Proof of concept branch for auto build/import of Docker image and workflow @ 10344-import-workflow-from-git

#8 Updated by Peter Amstutz over 2 years ago

Behavior in 10344-import-workflow-from-git:

This is based on the behavior of Dockstore (dockstore.org)

  1. Clone repository
  2. For each branch in the repository:
  3. Search for Dockerfiles
  4. Build Dockerfiles and name them based on repository name + location in repository
  5. Search for CWL files named Dockstore.cwl or CWLFile
  6. Register them as workflows
  7. Create a link record to associate the repository + branch with the workflow record, so that the workflow can be updated instead of creating a new one each time.

Usage

$ ./workflowimporter.py briandoconnor/dockstore-tool-bamstats develop
Cloning into '/tmp/tmpURhlYi'...
done.
Already on 'develop'
Your branch is up-to-date with 'origin/develop'.
Sending build context to Docker daemon 202.4 MB
Step 1 : FROM ubuntu:14.04
 ---> f6e25e99cf98
Step 2 : MAINTAINER Brian OConnor <briandoconnor@gmail.com>
 ---> Using cache
 ---> 30d6edff33a7
Step 3 : USER root
 ---> Using cache
 ---> 0f90323c0162
Step 4 : RUN apt-get -m update && apt-get install -y wget unzip openjdk-7-jre zip
 ---> Using cache
 ---> 2e013e76386c
Step 5 : RUN wget -q http://downloads.sourceforge.net/project/bamstats/BAMStats-1.25.zip
 ---> Using cache
 ---> d23414a4c725
Step 6 : RUN unzip BAMStats-1.25.zip &&     rm BAMStats-1.25.zip &&     mv BAMStats-1.25 /opt/
 ---> Using cache
 ---> 7c8ca1ebd48c
Step 7 : COPY bin/bamstats /usr/local/bin/
 ---> Using cache
 ---> f11d8fffbeac
Step 8 : RUN chmod a+x /usr/local/bin/bamstats
 ---> Using cache
 ---> 02cf4f6b9c5a
Step 9 : RUN groupadd -r -g 1000 ubuntu && useradd -r -g ubuntu -u 1000 -m ubuntu
 ---> Using cache
 ---> 4290f3727457
Step 10 : USER ubuntu
 ---> Using cache
 ---> 9b10c8810afc
Step 11 : CMD /bin/bash
 ---> Using cache
 ---> 393fb89a2ac7
Successfully built 393fb89a2ac7
962eh-4zz18-lu552la35bicizx
Updated workflow 962eh-7fd4e-gkbzl62qqtfig37

This makes the user experience pretty easy:

  1. Write Dockerfile
  2. Write Dockstore.cwl
  3. add to git & push
  4. Docker images + workflow appears in workbench automatically (when implemented as git hook or backend service)

#9 Updated by Bryan Cosca over 2 years ago

sounds pretty cool! a few questions:

what's Dockstore.cwl and why would I need to write that?
Can this work without Dockerfiles? What if the image is already in keep? Will the Dockerfiles overwrite that image?

#10 Updated by Peter Amstutz over 2 years ago

Bryan Cosca wrote:

sounds pretty cool! a few questions:

what's Dockstore.cwl and why would I need to write that?

The idea is to just register "primary" CWL files under a specific name. Otherwise it would register every single tool in the repository.

You can have more than one "Dockstore.cwl" in a single repo, they would just need to go into separate directories.

The reason for naming it "Dockstore.cwl" is to be compatible with Dockstore:

https://dockstore.org/containers/quay.io/briandoconnor/dockstore-tool-bamstats

https://github.com/briandoconnor/dockstore-tool-bamstats

However we could also try to persuade the Dockstore developers to support a more generic name, like "CWLFile".

Can this work without Dockerfiles?

Yes, however in that case it would need to already be in keep, or pull the docker image from somewhere else like docker hub.

What if the image is already in keep? Will the Dockerfiles overwrite that image?

The idea is that if you provide a Dockerfile, every time you push your branch, it will run docker build. If nothing has changed, you will get cached layers and the same image. If the image has changed, it will update it. If it is not different, it won't change anything. The goal is to make your work as a bioinformatician easier by automating the currently somewhat manual steps of managing the Docker image and Workflow record.

I'd like for this to become a new service that Arvados provides, but the script is written such that you could start using it right now.

#11 Updated by Tom Morris over 1 year ago

  • Target version set to Arvados Future Sprints

#13 Updated by Tom Morris over 1 year ago

  • Related to Story #13080: Provider users the ability to do entire workflow building process through the web interface added

Also available in: Atom PDF