Story #8563

[Docs] Pipeline author guide gives a basic demonstration of including a third-party tool

Added by Brett Smith almost 4 years ago. Updated almost 4 years ago.

In Progress
Assigned To:
Sarah Guthrie
Target version:
Support - Pipeline Future Sprints
Start date:
Due date:
% Done:


Estimated time:
(Total: 0.00 h)
Story points:


Write a new wiki page describing:

  • Basic introduction to writing a Dockerfile (with links to more resources/references), using a small but real analysis tool
  • How to build your Docker image
  • How to upload your Docker image to Arvados
  • How to call your tool from a Crunch script, including best practices (using subprocess.Popen, capturing stdout, uploading results, setting success based on Popen's returncode)
    • How to upload output from the tool using arvados.crunch.TaskOutputDir()
    • Explain when TaskOutputDir does not work:
      • The tool writes things that fuse does not support (symbollic links and named pipes)
      • The I/O access patterns are not performant with fuse (ex: 20 file handles on one file - tophat)
    • For when it doesn't work, explain how to use a tempdir and how to save one file from that directory or the entire directory tree to Keep


Task #8974: ReviewNewTom Morris


#1 Updated by Brett Smith almost 4 years ago

  • Target version set to Arvados Future Sprints

#2 Updated by Brett Smith almost 4 years ago

  • Target version changed from Arvados Future Sprints to Pipeline Future Sprints

#3 Updated by Sarah Guthrie almost 4 years ago

  • Description updated (diff)

#4 Updated by Sarah Guthrie almost 4 years ago

  • Status changed from New to In Progress

Brett, this is currently very unpolished, but I'd welcome comments

#5 Updated by Sarah Guthrie almost 4 years ago

More polished. It would be nice to have someone review this

#6 Updated by Brett Smith almost 4 years ago

Everything on the page is great. The examples are to-the-point and clear. The order of concepts on the page makes sense.

I think it could be even more useful if the page explained not just how to do things, but what things we're doing, and why. Imagine an excited bioinformatician who's just getting started with Arvados. They understand how to run the tool that they're porting, and we can (or should) assume that they have basic familiarity with Python, but we should assume they don't know about any technology beyond that. When we introduce new technology on the page—Docker, Keep, the Arvados SDK—we should explain its purpose, to give them the grounding to understand how the pieces fit together.

These introductions don't have to be comprehensive. Like you did for Dockerfiles, linking to other resources for more information is great. But a sentence or two of introduction will help ground the reader, and help tie together the disparate pieces.

With this kind of reader profile, here's the questions I imagine them asking as they go down the current page:

  • What's Docker?
  • What's a Docker image?
  • What's arvados/jobs?
  • What's the crunch user?
  • What's the arvados module import?
  • What's getjobparam? (What's a job?)
  • How does the script find the input file? (It might be nice to add a demonstration of get_task_param_mount—that might help illustrate the concept.)
  • What's the current_task? (What's a task?)
  • What is its tmpdir attribute?
  • What does TaskOutputDir() do?
  • What does task.set_output do?
  • What does the manifest_text method do? (What's a manifest? I think we can assume that a user who's gotten this far knows roughly that a collection is a set of files in Arvados, and knows the data is stored in Keep. They probably know what FUSE is too, since our tutorial covers that, but again, it would help to explain that each running task has FUSE set up for it.)
  • What I/O access patterns aren't performant with FUSE? (What's an "I/O access pattern?")
  • What does CollectionWriter() do? What do its methods do (write_file, write_directory_tree, finish)? (P.S., The example code in this section probably shouldn't have write_file('foo.txt').)
  • What's a Crunch script?
  • What's all the extra stuff about multiprocessing and CPU counts in the final script? (What's a thread?) (This is a worthy subject, but we might want to save this stuff for a separate page about optimization. Figuring out how to get the best performance from a tool depends a lot on the tool itself, and I'm worried covering that here would be too much detail for something intended to be an introduction.)

Whew! That's a lot. I kind of got in the mindset and got on a roll. But I hope it's not overwhelming, because I don't think it requires majorly overhauling the page. We already have documentation that explains a lot of this: there are user guide pages describing the basics of Docker and Arvados; of Crunch scripts; of pipeline templates, etc. We should link to those liberally. But I think addressing them, even with just one sentence, can make for a good checkpoint to be sure the reader is actually following along with the text. For example, "Here's a pipeline template that demonstrates how to use your new script. If you're not sure what a pipeline template is, or the format definition, check this User Guide page." Does that make sense?


#7 Updated by Sarah Guthrie almost 4 years ago

  • Assigned To set to Sarah Guthrie
  • Target version changed from Pipeline Future Sprints to 2016-04-27 sprint
  • Story points set to 0.5

Adding story points and moving it to the next sprint.

The proposed edits sound good.

#8 Updated by Sarah Guthrie almost 4 years ago

  • Target version changed from 2016-04-27 sprint to Pipeline Future Sprints

Also available in: Atom PDF