Task #4086

Support #4133: [Support] fix bugs and write tests (first half)

[Crunch] Not enough memory to run GATK

Added by Bryan Cosca over 5 years ago. Updated over 5 years ago.

Status:
Resolved
Priority:
High
Assigned To:
Brett Smith
Category:
Crunch
Target version:
Start date:
10/03/2014
Due date:
% Done:

0%

Estimated time:

Description

I'm trying to parallelize gatk haplotype caller to run on multiple cores using the -nct tag in jobs/qr1hi-8i9sb-zs7g4zo303dhdnu, but run into this error: There was a failure because you did not provide enough memory to run this program. See the -Xmx JVM argument to adjust the maximum heap size provided to Java.

Looking it up, most people say to play with the -Xmx command to allocate more, but 4g should be sufficient for the job, but maybe theres some underlying factors as well.


Related issues

Related to Arvados - Bug #4185: [Crunch] crunchstat memory reports seem suspect for multithreaded programsResolved10/14/2014

History

#1 Updated by Ward Vandewege over 5 years ago

  • Target version set to Bug Triage

#2 Updated by Radhika Chippada over 5 years ago

  • Subject changed from Not enough memory to run GATK to [Crunch] Not enough memory to run GATK
  • Category set to Crunch

#3 Updated by Ward Vandewege over 5 years ago

  • Priority changed from Normal to High

#4 Updated by Brett Smith over 5 years ago

  • Target version changed from Bug Triage to 2014-10-29 sprint
  • Parent task set to #4133

TODO: Check the GATK switches to make sure it's not doing something funny like specify "RAM per core/thread" or something like that.

Another idea of Ward's is that GATK doesn't consider swap to be RAM. That's another possibility to investigate.

#5 Updated by Brett Smith over 5 years ago

  • Status changed from New to In Progress
  • Assigned To set to Brett Smith

#6 Updated by Brett Smith over 5 years ago

Bryan,

There are lots of reports across the Web about the haplotype caller using very large amounts of RAM. Apparently it is very conservative in its coverage choices, which can cause RAM use to balloon quickly depending on the inputs. This GATK support thread seems to be the main focus of discussion about it.

A few notes and ideas that might help you get moving forward:

  • Have you run the haplotype caller successfully without parallelization? On the same inputs?
  • The compute nodes have 16GiB of physical RAM, and then 80GiB of swap backed by SSD storage. So there's basically no reason for you to say anything less than -Xmx15g (leaving some space for the OS, Docker, etc.), and you can go as high as -Xmx95g if you're okay with using the swap (which will be slower than RAM but shouldn't be backbreaking for performance).
  • Do you know if it's possible to turn on downsampling for this job? The lack of default downsampling seems to be the primary cause of ballooning RAM requirements, and turning it on seems to be a focus of that thread above. One developer summarizes the main suggestions as "minPruning and downsampling at the parameter level, and multithreading/scatter-gather parallelism at the execution level." If I'm following right, you're trying to do the last one now, so it seems worthwhile to explore more parameter adjustment.

Let me know if any of that's helpful.

#7 Updated by Bryan Cosca over 5 years ago

  • I have ran haplotype caller successfully before parallelization on the same inputs and different inputs
  • I'll experiment next with this -Xmx??g parameter to see whats the fastest.
  • You can change the downsampling parameter with -dcov but there are some problems. Specifically the developer of gatk recently stated (http://gatkforums.broadinstitute.org/discussion/4614/haplotypecaller-and-downsampling): "We have seen a couple of instances of odd behavior when dcov is used with HaplotypeCaller, which may be linked to the fact that HC does some downsampling on its own internally. This looks like it could be related to that behavior. Pending further investigation, our recommendation is to not change the dcov setting with HC.", so it is possible but highly not recommended.
  • Looking at minPruning, it seems like a tricky parameter to change. Looking around the forums, I think that generally if you increase minPruning, you reduce accuracy of the results, but you do get faster results. Its a tricky trade because I know we're really into getting the most accurate results.
  • I've looked into multithreading/scatter-gather and it seems like GATK has their own software called Queue that manages this for you. I thought about putting it into a docker image but I think I ran into a few complications that I don't recall off the top of my head.
  • Essentially I created my own hack for scatter-gather, where I separate the genome into chromosomes and run haplotype caller on each of those and then merge all the results at the end.

#8 Updated by Brett Smith over 5 years ago

Thanks for the follow-ups. I didn't see the warning about tweaking dcov, so thanks for flagging that. And I understand the compromise of setting minPruning, so it makes sense for that to be an option of last resort. So I think the steps from here are:

  • Try with -Xmx15g.
  • If that fails, try with -Xmx95g.
  • If that fails, let's reconvene. One thing we might look at is tweaking the threading parameters, like maybe using -nt instead of/in addition to -nct, or experimenting with different numbers there.

#9 Updated by Bryan Cosca over 5 years ago

Just some more follow up:

*I've tried tweaking -nt and changing the minimum number of nodes to a job to be >1 (specifically 5-20) and that did not change anything and it still ran on one node. I'm not sure if that parameter will change anything much (this may be due to my lack of deep understanding of threads and parallelization)

#10 Updated by Brett Smith over 5 years ago

Bryan Cosca wrote:

*I've tried tweaking -nt and changing the minimum number of nodes to a job to be >1 (specifically 5-20) and that did not change anything and it still ran on one node. I'm not sure if that parameter will change anything much (this may be due to my lack of deep understanding of threads and parallelization)

I do understand threads, and it's not immediately obvious to me how the different threading-related switches affect GATK's behavior. No point trying stuff you've already tried that doesn't do what we need.

Do you have a UUID for the single-threaded job that ran on these inputs? crunchstat will tell us how much RAM the single-threaded version took. If we can get a sense of how GATK's multithreading affects memory use (e.g., every thread introduces X% overhead), we should be able to get a sense of how much RAM the multithreaded version will need.

#11 Updated by Bryan Cosca over 5 years ago

Here's a few job examples: qr1hi-8i9sb-tjq8yphaeckclpc (whole exome takes ~8 hours), qr1hi-8i9sb-71qnwnebru4wud1 (whole exome ~ 6 hours), qr1hi-8i9sb-io9edg3lekma5w3 (chr 1 ~50 minutes), qr1hi-8i9sb-1xdtgwzze13kytk (chr Y ~13 minutes).

The first one uses different parameters to call variants, I'm curious if those different parameters make a difference as well.

#12 Updated by Brett Smith over 5 years ago

Bryan Cosca wrote:

Here's a few job examples: qr1hi-8i9sb-tjq8yphaeckclpc (whole exome takes ~8 hours), qr1hi-8i9sb-71qnwnebru4wud1 (whole exome ~ 6 hours), qr1hi-8i9sb-io9edg3lekma5w3 (chr 1 ~50 minutes), qr1hi-8i9sb-1xdtgwzze13kytk (chr Y ~13 minutes).

The first one uses different parameters to call variants, I'm curious if those different parameters make a difference as well.

I don't see any difference in the parameters of those jobs, except for the input collection. Are you referring to different parameters in the jobs that created the input?

The 8-hour job used just under a gigabyte of RAM. If we conservatively assume that (a) each thread will use about the same amount of RAM, and (b) that the specific input you're working with for this multithreaded job is going to require twice as much RAM as the 8-hour job, then it should be possible to run the job with -Xmx15g -nct 6. That should lead to RAM use of (6 threads * 2 GiB) == 12GiB, which easily fits under the specified 15g maximum. I think you should try that now, and if it fails, try again with -Xmx95g.

#13 Updated by Bryan Cosca over 5 years ago

I finished the -nct 6, 15g run here: jobs/qr1hi-8i9sb-l7vv9qqxozezv38 and it finished in ~4 hours. I started running the 95g job but I didnt finish it because it was looking to take ~8 hours. So it seems that the nct is working great! As we were discussing before, its weird that it is only using ~.5gb of ram. That might be worth still investigating.

#14 Updated by Brett Smith over 5 years ago

Bryan Cosca wrote:

I finished the -nct 6, 15g run here: jobs/qr1hi-8i9sb-l7vv9qqxozezv38 and it finished in ~4 hours. I started running the 95g job but I didnt finish it because it was looking to take ~8 hours.

It sounds like what's happening is that Java allocates memory for its heap up front, rather than growing it on demand. Because of that, it starts accessing swap relatively early in the process, and that slows the whole thing down. I wish Java were more dynamic about this, but if this is how it behaves, it makes sense for performance to try to keep the maximum allocation inside real RAM rather than bleeding into swap.

If you want to try for even more performance, the next thing to do would be to increase the number of threads (the -nct argument). The compute nodes on qr1hi have 8 cores, so that's the highest number you should try.

So it seems that the nct is working great! As we were discussing before, its weird that it is only using ~.5gb of ram. That might be worth still investigating.

I created a separate bug for this, #4185. I think the specific issue with GATK is resolved. Let me know if you agree, and I'll close this ticket if so. Thanks.

#15 Updated by Bryan Cosca over 5 years ago

I agree. Thanks!

#16 Updated by Brett Smith over 5 years ago

  • Status changed from In Progress to Resolved
  • Remaining (hours) set to 0.0

Also available in: Atom PDF