Story #8940

[Docs] Write "How to optimize a pipeline" Page

Added by Sarah Guthrie over 3 years ago. Updated over 3 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Documentation
Target version:
Start date:
04/12/2016
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
1.0

Description

Go through at least these points:

  • Running crunchstat_summary (how to pick runtime_constraints)
  • Picking the correct number of threads
  • How to optimize the number of tasks when you don't have native multithreading
  • Choosing between TaskOutputDir and tmpdir
  • Changing keep_cache_size
  • When to pipe and when to write to keep

Subtasks

Task #8967: ReviewResolvedSarah Guthrie

Task #8983: ReviewResolvedTom Clegg

History

#1 Updated by Sarah Guthrie over 3 years ago

  • Target version changed from Pipeline Future Sprints to 2016-04-27 sprint

#2 Updated by Bryan Cosca over 3 years ago

  • Assigned To set to Bryan Cosca

#3 Updated by Bryan Cosca over 3 years ago

  • Status changed from New to In Progress

#5 Updated by Sarah Guthrie over 3 years ago

Adding an overview would be very helpful. What, specifically, will the page be going over and why? It currently is hard to follow and understand. I might split the page using more of a "how-to" approach:
  • How do I alter the design of the pipeline?
    • Splitting work into jobs
    • If one job runs multiple pieces of software - piping vs writing to a tmpdir
    • Writing the output of a job (TaskOutputDir vs tmpdir)
  • max_tasks_per_node
    • What is it and when does it affect me?
    • How do I choose it?
  • min_ram_mb_per_node
    • What is it and when does it affect me?
    • How do I choose it?
  • num_cores_per_node
    • What is it and when does it affect me?
    • How do I choose it?
  • keep_cache_mb_per_task
    • What is it and when does it affect me?
    • How do I choose it?

It could be helpful to show the commands that generate the crunchstat-summary outputs you show as examples. An example of a pathological job could also be really helpful.

#6 Updated by Bryan Cosca over 3 years ago

I've added all your comments and updated the wiki.

#7 Updated by Sarah Guthrie over 3 years ago

Awesome! This is much better.

Again, feel free to link to https://dev.arvados.org/projects/arvados/wiki/Writing_a_Script_Calling_a_Third_Party_Tool if you think it'll be helpful.

A few more comments:

Choosing the right number of jobs

This section is very confusing - which question are you answering?

Do I want to do alignment and variant calling in one step? Should I separate them? (usually, no!)

You've implicitly described the disadvantages of writing many things to keep, but specifically stating that
  • Intermediate files can take up a lot of unnecessary space
  • Writing to keep requires that the next job reads from keep, which generates I/O wait time that can be skipped by piping or reduced by storing locally in the temporary directory

One more advantage that can be mentioned here is that jobs that don't depend on each other will be run concurrently if the nodes are available. This results in a speed-up. You can say that the dependencies are determined automatically from the pipeline template component definitions.

Writing to keep

Mentioning that TaskOutputDir operates on a fuse-mount is probably a good idea, since some people will know the constraints fuse-mounts operate on.

How to use crunchstat-summary

I think you might be missing the code format here:

$ ~/arvados/tools/crunchstat-summary/bin/crunchstat-summary --format html --job qr1hi-8i9sb-bzn6hzttfu9cetv > qr1hi-8i9sb-bzn6hzttfu9cetv.html

#8 Updated by Bryan Cosca over 3 years ago

Do I want to do alignment and variant calling in one step? Should I separate them? (usually, no!)

... I totally meant yes (no to the first question and yes to the second...) words.

Everything else I added.

#9 Updated by Sarah Guthrie over 3 years ago

I'm ready for Brett to review this

#10 Updated by Tom Clegg over 3 years ago

Suggest keeping pronouns consistent. In the "questions you may be asking" part, the reader becomes "I" for a few lines and then goes back to being "you" afterwards. E.g., perhaps change to "Relevant questions include: Do you want ...?"

In the "try piping" paragraph:
  • "has shown to sometimes be faster" → "is generally faster"
  • afaik the only cases where this isn't true are the cases where running both tools at once causes expensive resource contention, e.g., arv-mount cache is too small to support more concurrent readers (reading from network is far slower than reading from memory cache) or there isn't enough memory for both tools to run at once (programs crash or resort to staging temp data to disk instead of keeping it in memory). If you come up with a reasonable way to explain this, I think it would be valuable.

"If you want a lot of checkpoints..." paragraph: maybe split "different node types" issue to its own paragraph. And the "node types" issue seems a bit murky. Are we saying bwa-mem should be in the same job as fastqc, or a separate job? My sense is the right answer is "it depends"... But "single-threaded | single-threaded | single-threaded" is often a good way to get efficient resource usage, especially if they don't all use lots of RAM: essentially the low-RAM processes run free. This is similar/related to the "how many tasks" discussion (not sure how to make use of that connection).

"using a tmpdir is always the safe alternative" → ...only if you request/get enough scratch space on your node. (The other advantage of using the fuse mount as scratch space is that space is ~unlimited.)

"job that's very i/o intensive" -- could get more specific about this. AFAIK increasing the cache doesn't help an IO-bound process that reads big files one at a time without seeking. "Multiple concurrent reader processes" and "random access (seeking) on the input file" are the cases that really benefit from a bigger arv-mount cache, right?

"These recommendations are for you to set to ensure the job will be able to call the right node type and run reliably when reproduced" → maybe "These suggest ways to introduce or reduce runtime constraints in order to use cheaper nodes when running similar jobs, without making them slow down or run out of memory." ...?

#11 Updated by Bryan Cosca over 3 years ago

Thanks! I've added all your comments in.

"job that's very i/o intensive" -- could get more specific about this. AFAIK increasing the cache doesn't help an IO-bound process that reads big files one at a time without seeking. "Multiple concurrent reader processes" and "random access (seeking) on the input file" are the cases that really benefit from a bigger arv-mount cache, right?

Yes, that makes sense. I changed the line to job that's very cpu intensive. Because if we're computing faster we may need more data faster... that makes sense right?

If a job is i/o intensive, I've suggested that they copy the file to temp space. Do you have more options? I remember this from a bam merge job, I don't remember the specific merge algorithm that would make this slow, if you know one that does, then I can put it in.

#12 Updated by Tom Clegg over 3 years ago

"For instance, not having enough memory to support both processes or arv-mount cache is too small to support reading from." → should this say "...to support reading from multiple processes reading from different files at the same time"?

I changed the line to job that's very cpu intensive. Because if we're computing faster we may need more data faster... that makes sense right?

Increasing the arv-mount cache only helps when there's non-sequential access, which typically means either one process doing a lot of seeking or (more likely) one or more processes reading multiple files at once. I wouldn't say it's closely tied to CPU-intensive or CPU-bound processes. For example, bwa reads sequentially, and is typically CPU-bound even though it can use lots of cores. Increasing its arv-mount cache won't make it run any faster.

I think the most useful indicator that you need a bigger arv-mount cache is the "cache utilization" figure. crunchstat-summary suggests increasing the cache when utilization is below 80% (i.e., for every 100 bytes arv-mount reads from the network, your program is only getting <80 bytes of input) because this usually means chunks are being read from the network but then ejected from the cache before your program gets a chance to read them.

If a job is i/o intensive, I've suggested that they copy the file to temp space.

If you mean "seeks a lot on its input" (more than you can handle even by increasing arv-mount cache) or "writes a lot of data and then reads it back" then yes, temp space is a good option.

#13 Updated by Bryan Cosca over 3 years ago

I've added all your comments.

#14 Updated by Bryan Cosca over 3 years ago

  • Status changed from In Progress to Resolved

Also available in: Atom PDF