Project

General

Profile

Actions

Idea #3015

closed

Make gatk3 pipeline template

Added by Tom Clegg almost 10 years ago. Updated almost 10 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Start date:
06/24/2014
Due date:
Story points:
3.0

Description

  • Create a project
  • Download and add the appropriate reference & example datasets to the project
  • Make a docker image with all of the relevant redistributable software pre-installed
  • Make a "pirs" crunch script and use it to generate the simulation dataset based on hg19 chr1
  • Make a "Single sample SNV with bwa and gatk" pipeline with no parallel/asynchronous tasks
  • Time permitting, make another pipeline that splits the inputs as described below in order to get faster turnaround time when multiple nodes are available.
The attached script and the existing GATK exome pipeline should be helpful. Notes:
  • Use FUSE mount for inputs
  • GATK3 (like attached) not GATK2 (like existing pipeline)
  • Use a docker image with redistributable tools pre-installed, assuming this makes things easier (but not GATK itself - continue to pass this tarball as a job input)
  • Use the file-select script to get appropriate bits from the GATK bundle (which we should have an entire copy of in our project), rather than downloading individual files needed.
  • Existing pipeline provides clues (not necessarily all correct with latest tool versions) about which tools are capable of reading/writing pipes rather than regular files.

Notes about parallelizing:

We can split the FASTQ into many chunks as we want, however after the mapping, we should merge the alignments from one sample into single SAM/BAM file to stack the reads on each genome position. Then we split the SAM/BAM file again by chromosome. So roughly speaking we can get 24 or 25 BAM fragments then all downstream steps could be applied on these chromosome based BAM fragments. At last, probably after annotation, we merge fragment files into one final file. To increase parallelism we can even split the BAM on positions where have very low/no coverage.


Files

Single_Sample_SNV_Pipeline.txt (6.82 KB) Single_Sample_SNV_Pipeline.txt Tom Clegg, 06/18/2014 01:04 AM

Subtasks 5 (0 open5 closed)

Task #3068: Write pipeline templateResolvedPeter Amstutz06/24/2014Actions
Task #3107: Run on AWS instance (4xphq/qr1hi)ResolvedPeter Amstutz06/24/2014Actions
Task #3066: Build docker imageResolvedPeter Amstutz06/24/2014Actions
Task #3069: Run on test dataResolvedPeter Amstutz06/26/2014Actions
Task #3067: Write scripts for each stageResolvedPeter Amstutz06/24/2014Actions

Related issues

Related to Arvados - Bug #3373: [Sample pipelines] Improve SNV pipeline to accept example exome fastq data (2 pairs of reads) as a single input collection.ResolvedPeter Amstutz07/30/2014Actions
Actions

Also available in: Atom PDF