Everything on the page is great. The examples are to-the-point and clear. The order of concepts on the page makes sense.
I think it could be even more useful if the page explained not just how to do things, but what things we're doing, and why. Imagine an excited bioinformatician who's just getting started with Arvados. They understand how to run the tool that they're porting, and we can (or should) assume that they have basic familiarity with Python, but we should assume they don't know about any technology beyond that. When we introduce new technology on the page—Docker, Keep, the Arvados SDK—we should explain its purpose, to give them the grounding to understand how the pieces fit together.
These introductions don't have to be comprehensive. Like you did for Dockerfiles, linking to other resources for more information is great. But a sentence or two of introduction will help ground the reader, and help tie together the disparate pieces.
With this kind of reader profile, here's the questions I imagine them asking as they go down the current page:
- What's Docker?
- What's a Docker image?
- What's arvados/jobs?
- What's the crunch user?
- What's the arvados module import?
- What's getjobparam? (What's a job?)
- How does the script find the input file? (It might be nice to add a demonstration of
get_task_param_mount
—that might help illustrate the concept.)
- What's the current_task? (What's a task?)
- What is its tmpdir attribute?
- What does TaskOutputDir() do?
- What does
task.set_output
do?
- What does the
manifest_text
method do? (What's a manifest? I think we can assume that a user who's gotten this far knows roughly that a collection is a set of files in Arvados, and knows the data is stored in Keep. They probably know what FUSE is too, since our tutorial covers that, but again, it would help to explain that each running task has FUSE set up for it.)
- What I/O access patterns aren't performant with FUSE? (What's an "I/O access pattern?")
- What does CollectionWriter() do? What do its methods do (write_file, write_directory_tree, finish)? (P.S., The example code in this section probably shouldn't have
write_file('foo.txt')
.)
- What's a Crunch script?
- What's all the extra stuff about multiprocessing and CPU counts in the final script? (What's a thread?) (This is a worthy subject, but we might want to save this stuff for a separate page about optimization. Figuring out how to get the best performance from a tool depends a lot on the tool itself, and I'm worried covering that here would be too much detail for something intended to be an introduction.)
Whew! That's a lot. I kind of got in the mindset and got on a roll. But I hope it's not overwhelming, because I don't think it requires majorly overhauling the page. We already have documentation that explains a lot of this: there are user guide pages describing the basics of Docker and Arvados; of Crunch scripts; of pipeline templates, etc. We should link to those liberally. But I think addressing them, even with just one sentence, can make for a good checkpoint to be sure the reader is actually following along with the text. For example, "Here's a pipeline template that demonstrates how to use your new script. If you're not sure what a pipeline template is, or the format definition, check this User Guide page." Does that make sense?
Thanks.