Bug #3373

[Sample pipelines] Improve SNV pipeline to accept example exome fastq data (2 pairs of reads) as a single input collection.

Added by Tom Clegg about 5 years ago. Updated about 5 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Sample Pipelines
Target version:
Start date:
07/30/2014
Due date:
% Done:

100%

Estimated time:
(Total: 4.00 h)
Story points:
2.0

Subtasks

Task #3488: Review 3373-improve-gatk3-snv-pipelineResolvedPeter Amstutz

Task #3417: Split by read pair, perform parallel alignment, merge bam filesResolvedPeter Amstutz

Task #3548: Update pipeline to use arvados repository once mergedResolvedPeter Amstutz

Task #3547: Add Haplotype callerResolvedPeter Amstutz

Task #3549: Use GATK 3.2ResolvedPeter Amstutz


Related issues

Related to Arvados - Story #3015: Make gatk3 pipeline templateResolved06/24/2014

Associated revisions

Revision c5be0c3a
Added by Peter Amstutz about 5 years ago

Merge branch '3373-improve-gatk3-snv-pipeline' closes #3373

Revision 20b2859d (diff)
Added by Peter Amstutz about 5 years ago

Fix syntax error in decompress-all.py refs #3373

Revision b00bc4b6 (diff)
Added by Peter Amstutz about 5 years ago

Fix regular expression in decompress-all.py refs #3373

History

#1 Updated by Peter Amstutz about 5 years ago

  • Assigned To set to Peter Amstutz

#2 Updated by Peter Amstutz about 5 years ago

  • Status changed from New to In Progress

#3 Updated by Peter Amstutz about 5 years ago

  • Status changed from In Progress to Resolved

#4 Updated by Tom Clegg about 5 years ago

In decompress-all.py
  • I think it would be clearer to do all the error handling for re.match right away (and emit an error message!) rather than indenting the rest of the program. I also think "result" variable could be given a more descriptive name. And I think you forgot to import sys. (Thanks, Python.)
    import sys
    ...
    infile_parts = re.match(r"(^[a-f0-9]{32}\+\d+)(\+\S+)*(/.*)(/[^/]+)$", input_file)
    if infile_parts == None:
        print >>sys.stderr, "Failed to parse input filename '%s' as a Keep file\n" % input_file
        sys.exit(1)
    
    cr = arvados.CollectionReader(infile_parts.group(1))
    ...
    
  • I suspect the third regexp group should have a ? after it, so {hash}/foo.gz works.
  • This looks suspiciously like "dtrx failed" is always interpreted as "file is uncompressed, pass through unchanged", which seems overly optimistic:
        if rc == 0:
            out = arvados.CollectionWriter()
            out.write_directory_tree(outdir, max_manifest_depth=0)
            task.set_output(out.finish())
        else:
            task.set_output(streamname + filereader.as_manifest()[1:])
    
In split_fastq
  • We appear to silently ignore any fastq.gz files that aren't in the top-level stream of the manifest. Why not process everything regardless of stream name? If we can't do that correctly for some reason, it's probably better to fail rather than silently drop data.
  • If paired and unpaired reads appear in the input, we should not silently drop the unpaired reads. Perhaps the logic could be simplified to something like "for each file: if _1 then do a pair; elif _2 then skip; elif fastq then process unpaired", and check afterward that the number of files processed is equal to the number of files encountered (in order to detect a _2 with no corresponding _1).
  • This sort of thing should go to stderr: print "Finish piece ./_%s/%s (%s %s)"
  • chunking size should be configurable (I see that chunking isn't even possible to enable, but this seems like a trivial change)
  • Add brief comment to chunking code to indicate development state (the only clue I see is that it's hard-coded to False, which suggests it is either untested or known to be broken)
run-command
  • This isn't new in this branch, but it looks like an extraneous space: subst.default_subs["link "] = sub_link
  • Print diagnostics to stderr, not stdout
  • There's heavy use of a few different variables called "p" here. Could we have a more descriptive name, at least for the global?
arv-run-pipeline-instance
  • I think .to_json is no longer appropriate here (and in the 3 other similar instances), now that we're asking api_client to serialize body_object for us:
                             :body_object => {
                               :pipeline_instance => attributes.to_json
                             },

#5 Updated by Tom Clegg about 5 years ago

  • Category set to Sample Pipelines
  • Status changed from Resolved to In Progress

#6 Updated by Peter Amstutz about 5 years ago

  • Target version changed from 2014-08-06 Sprint to 2014-08-27 Sprint

#7 Updated by Peter Amstutz about 5 years ago

  • Target version deleted (2014-08-27 Sprint)

Tom Clegg wrote:

In decompress-all.py
  • I think it would be clearer to do all the error handling for re.match right away (and emit an error message!) rather than indenting the rest of the program. I also think "result" variable could be given a more descriptive name. And I think you forgot to import sys. (Thanks, Python.)

Fixed

  • I suspect the third regexp group should have a ? after it, so {hash}/foo.gz works.

Fixed

  • This looks suspiciously like "dtrx failed" is always interpreted as "file is uncompressed, pass through unchanged", which seems overly optimistic
<tetron> brett: does dtrx distinguish between "this file type is unknown/not an archive" and "this file appears to be corrupt?" 
<brett> No.  It is not so easy to distinguish between those cases.
In split_fastq
  • We appear to silently ignore any fastq.gz files that aren't in the top-level stream of the manifest. Why not process everything regardless of stream name? If we can't do that correctly for some reason, it's probably better to fail rather than silently drop data.

Fixed.

  • If paired and unpaired reads appear in the input, we should not silently drop the unpaired reads. Perhaps the logic could be simplified to something like "for each file: if _1 then do a pair; elif _2 then skip; elif fastq then process unpaired", and check afterward that the number of files processed is equal to the number of files encountered (in order to detect a _2 with no corresponding _1).

Fixed.

  • This sort of thing should go to stderr: print "Finish piece ./_%s/%s (%s %s)"

Removed some debugging prints and changed the rest to go to stderr.

  • chunking size should be configurable (I see that chunking isn't even possible to enable, but this seems like a trivial change)

Left that alone until I revisit chunking with a more efficient algorithm and chunking is worth doing.

  • Add brief comment to chunking code to indicate development state (the only clue I see is that it's hard-coded to False, which suggests it is either untested or known to be broken)

Added comment.

run-command
  • This isn't new in this branch, but it looks like an extraneous space: subst.default_subs["link "] = sub_link

That's intentional. In the absence of more sophisticated parsing, the various special case substitutions are detected with str.startswith() so the space is there in avoid situations like matching the beginning of "$(link_name)" and calling it with "_name" as the argument.

  • Print diagnostics to stderr, not stdout

Uses Python logging module now, configured to go to stderr.

  • There's heavy use of a few different variables called "p" here. Could we have a more descriptive name, at least for the global?

Fixed global 'p' is now 'taskp'

arv-run-pipeline-instance
  • I think .to_json is no longer appropriate here (and in the 3 other similar instances), now that we're asking api_client to serialize body_object for us:

Fixed.

#8 Updated by Peter Amstutz about 5 years ago

  • Target version set to 2014-08-27 Sprint

#9 Updated by Tom Clegg about 5 years ago

Peter Amstutz wrote:

  • This looks suspiciously like "dtrx failed" is always interpreted as "file is uncompressed, pass through unchanged", which seems overly optimistic

<tetron> brett: does dtrx distinguish between "this file type is unknown/not an archive" and "this file appears to be corrupt?"
<brett> No. It is not so easy to distinguish between those cases.

We should pattern-match the filename to decide whether to try decompressing it, and then demand the expected behavior. This is more predictable and safer:
  • If something is called *.gz but isn't actually compressed (or decompression fails for some other reason), you fail.
  • If something is called *.txt and is compressed, we pass it through unchanged. If you really expected this to give you transparent decompression, you fail.

It looks like dtrx doesn't have anything like --list-supported-extensions but we can make a pretty good guess that \.(gz|Z|bz2|tgz|tbz|zip|rar|7z|cab|deb|rpm|cpio|gem)$ should work.

I'm pushing on this because I've already seen silently ignored decompression failures result in many days of wasted compute time downstream, not to mention troubleshooting time. Thanks, pigz, for exiting 0 so much.

In this case, if we assume failure means "already uncompressed", we get an unpredictable script which sometimes outputs "foo.txt.gz" and sometimes outputs "foo.txt", depending on whether some transient system error occurred, whether dtrx was installed, etc. Surely we'd rather be predictable, even if it means sacrificing the ability to transparently decompress gzip data that is disguised as *.txt. :P

Everything else LGTM, thanks!

#10 Updated by Anonymous about 5 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 40 to 100

Applied in changeset arvados|commit:c5be0c3abd926adc54e0c1de65e8dfdd25a84ea1.

Also available in: Atom PDF