[Docs] Pipeline author guide gives a basic demonstration of including a third-party tool
Write a new wiki page describing:
- Basic introduction to writing a Dockerfile (with links to more resources/references), using a small but real analysis tool
- How to build your Docker image
- How to upload your Docker image to Arvados
- How to call your tool from a Crunch script, including best practices (using subprocess.Popen, capturing stdout, uploading results, setting success based on Popen's returncode)
- How to upload output from the tool using arvados.crunch.TaskOutputDir()
- Explain when TaskOutputDir does not work:
- The tool writes things that fuse does not support (symbollic links and named pipes)
- The I/O access patterns are not performant with fuse (ex: 20 file handles on one file - tophat)
- For when it doesn't work, explain how to use a tempdir and how to save one file from that directory or the entire directory tree to Keep
#4 Updated by Sarah Guthrie over 3 years ago
- Status changed from New to In Progress
Brett, this is currently very unpolished, but I'd welcome comments
#6 Updated by Brett Smith over 3 years ago
Everything on the page is great. The examples are to-the-point and clear. The order of concepts on the page makes sense.
I think it could be even more useful if the page explained not just how to do things, but what things we're doing, and why. Imagine an excited bioinformatician who's just getting started with Arvados. They understand how to run the tool that they're porting, and we can (or should) assume that they have basic familiarity with Python, but we should assume they don't know about any technology beyond that. When we introduce new technology on the page—Docker, Keep, the Arvados SDK—we should explain its purpose, to give them the grounding to understand how the pieces fit together.
These introductions don't have to be comprehensive. Like you did for Dockerfiles, linking to other resources for more information is great. But a sentence or two of introduction will help ground the reader, and help tie together the disparate pieces.
With this kind of reader profile, here's the questions I imagine them asking as they go down the current page:
- What's Docker?
- What's a Docker image?
- What's arvados/jobs?
- What's the crunch user?
- What's the arvados module import?
- What's getjobparam? (What's a job?)
- How does the script find the input file? (It might be nice to add a demonstration of
get_task_param_mount—that might help illustrate the concept.)
- What's the current_task? (What's a task?)
- What is its tmpdir attribute?
- What does TaskOutputDir() do?
- What does
- What does the
manifest_textmethod do? (What's a manifest? I think we can assume that a user who's gotten this far knows roughly that a collection is a set of files in Arvados, and knows the data is stored in Keep. They probably know what FUSE is too, since our tutorial covers that, but again, it would help to explain that each running task has FUSE set up for it.)
- What I/O access patterns aren't performant with FUSE? (What's an "I/O access pattern?")
- What does CollectionWriter() do? What do its methods do (write_file, write_directory_tree, finish)? (P.S., The example code in this section probably shouldn't have
- What's a Crunch script?
- What's all the extra stuff about multiprocessing and CPU counts in the final script? (What's a thread?) (This is a worthy subject, but we might want to save this stuff for a separate page about optimization. Figuring out how to get the best performance from a tool depends a lot on the tool itself, and I'm worried covering that here would be too much detail for something intended to be an introduction.)
Whew! That's a lot. I kind of got in the mindset and got on a roll. But I hope it's not overwhelming, because I don't think it requires majorly overhauling the page. We already have documentation that explains a lot of this: there are user guide pages describing the basics of Docker and Arvados; of Crunch scripts; of pipeline templates, etc. We should link to those liberally. But I think addressing them, even with just one sentence, can make for a good checkpoint to be sure the reader is actually following along with the text. For example, "Here's a pipeline template that demonstrates how to use your new script. If you're not sure what a pipeline template is, or the format definition, check this User Guide page." Does that make sense?