Running Platypus using Arvados

This tutorial demonstrates how to call variants from high-throughput sequencing data using Platypus. Platypus is a research project by The Wellcome Trust Centre for Human Genetics. The Platypus page publication is available here: Andy Rimmer, Hang Phan, Iain Mathieson, Zamin Iqbal, Stephen R. F. Twigg, WGS500 Consortium, Andrew O. M. Wilkie, Gil McVean, Gerton Lunter. Integrating mapping-, assembly- and haplotype-based approaches for calling variants in clinical sequencing applications. Nature Genetics. This tutorial introduces the following Arvados features:

  • How to run Platypus using Arvados
  • How to access your pipeline results.
  • How to browse and select your input data for Platypus and submit re-run the pipeline.
  1. Start at the Curoverse website and click Log In at the top. We currently support all Google / Google Apps accounts for authentication. By simply choosing a Google-based account, your account will be automatically created and redirect to the Arvados Workbench.
  2. In the Active pipelines panel, click on the Run a pipeline... button. Doing so opens a dialog box titled Choose a pipeline to run.
  3. Select Platypus (public) and click the Next: choose inputs button. Doing so loads a new page to supply the inputs for the pipeline.
  4. The default inputs from the Platypus source code repository are already pre-loaded. Click on the Run button. The page updates to show you that the pipeline has been submitted to run on the Arvados cluster.
  5. After the pipeline starts running, you can track its progress by watching log messages from jobs. This page refreshes automatically. You will see a complete label under the job the column when the pipeline completes successfully. The current run time of the job in CPU and clock hours is also displayed. You can view individual job details by clicking on the job name.
  6. Once the job is finished, the output can be viewed to the right of the run time.
  7. Click on the download button to the right of the file to download your results, or the magnifying glass to quickly view your results.

Uploading data through the web and using it on Arvados

  1. In your home project, click on the blue + Add data button in the top right.
  2. Click Upload files from my computer
  3. Click Choose Files and choose paired end fastq files you would like to run BWA on.
  4. Once you're ready, click > Start
  5. Feel free to rename your Collection so you can remember it later. Click on the pencil icon in the top left corner next to New collection
  6. Once that is complete, navigate back to the dashboard and click on Run a pipeline... and choose Platypus (Public).
  7. You can change the input by clicking on the [Choose] button next to the Input raw fastq file collection.
  8. Click on the dropdown menu, click on Home, and choose your desired input collection. Click OK and Run to run the Platypus Pipeline

Uploading data through your shell and using it on Arvados

Full documentation can be found here

  1. Install the Arvados Python SDK on the system from which you will upload the data (such as your workstation, or a server containing data from your sequencer). Doing so will install the Arvados file upload tool, arv-put.
  2. To configure the environment with the Arvados instance host name and authentication token, see here
  3. Navigate back to your Workbench dashboard and create a new project by clicking on the Projects dropdown menu and clicking Home.
  4. Click on [+ Add a subproject]. Feel free to edit the Project name or description by clicking the pencil to the right of the text.
  5. To add data, return to your shell, create a folder, and put the VCF files you want to upload inside. Use the command arv-put * --project-uuid qr1hi-xxxxx-yyyyyyyyyyyyyyy. The qr1hi tag can be found in the url of your new project. This ensures that all the files you would like to upload are in one collection.
  6. The output value xxxxxxxxxxxxxxxxxxxx+yyyy is the Arvados collection locator that uniquely describes this file.
  7. Once that is complete, navigate back to the dashboard and click on Run a pipeline... and choose Platypus (Public).
  8. You can change the input by clicking on the [Choose] button next to the Input raw fastq file collection.
  9. Click on the dropdown menu, click on Home, and choose your desired input collection. Click OK and Run to run the Platypus Pipeline

FAQ

WIP