Project

General

Profile

Actions

Bug #3373

closed

[Sample pipelines] Improve SNV pipeline to accept example exome fastq data (2 pairs of reads) as a single input collection.

Added by Tom Clegg over 10 years ago. Updated over 10 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Sample Pipelines
Target version:
Story points:
2.0

Subtasks 5 (0 open5 closed)

Task #3488: Review 3373-improve-gatk3-snv-pipelineResolvedPeter Amstutz07/30/2014Actions
Task #3417: Split by read pair, perform parallel alignment, merge bam filesResolvedPeter Amstutz07/30/2014Actions
Task #3548: Update pipeline to use arvados repository once mergedResolvedPeter Amstutz07/30/2014Actions
Task #3547: Add Haplotype callerResolvedPeter Amstutz07/30/2014Actions
Task #3549: Use GATK 3.2ResolvedPeter Amstutz07/30/2014Actions

Related issues 1 (0 open1 closed)

Related to Arvados - Idea #3015: Make gatk3 pipeline templateResolvedPeter Amstutz06/24/2014Actions
Actions #1

Updated by Peter Amstutz over 10 years ago

  • Assigned To set to Peter Amstutz
Actions #2

Updated by Peter Amstutz over 10 years ago

  • Status changed from New to In Progress
Actions #3

Updated by Peter Amstutz over 10 years ago

  • Status changed from In Progress to Resolved
Actions #4

Updated by Tom Clegg over 10 years ago

In decompress-all.py
  • I think it would be clearer to do all the error handling for re.match right away (and emit an error message!) rather than indenting the rest of the program. I also think "result" variable could be given a more descriptive name. And I think you forgot to import sys. (Thanks, Python.)
    import sys
    ...
    infile_parts = re.match(r"(^[a-f0-9]{32}\+\d+)(\+\S+)*(/.*)(/[^/]+)$", input_file)
    if infile_parts == None:
        print >>sys.stderr, "Failed to parse input filename '%s' as a Keep file\n" % input_file
        sys.exit(1)
    
    cr = arvados.CollectionReader(infile_parts.group(1))
    ...
    
  • I suspect the third regexp group should have a ? after it, so {hash}/foo.gz works.
  • This looks suspiciously like "dtrx failed" is always interpreted as "file is uncompressed, pass through unchanged", which seems overly optimistic:
        if rc == 0:
            out = arvados.CollectionWriter()
            out.write_directory_tree(outdir, max_manifest_depth=0)
            task.set_output(out.finish())
        else:
            task.set_output(streamname + filereader.as_manifest()[1:])
    
In split_fastq
  • We appear to silently ignore any fastq.gz files that aren't in the top-level stream of the manifest. Why not process everything regardless of stream name? If we can't do that correctly for some reason, it's probably better to fail rather than silently drop data.
  • If paired and unpaired reads appear in the input, we should not silently drop the unpaired reads. Perhaps the logic could be simplified to something like "for each file: if _1 then do a pair; elif _2 then skip; elif fastq then process unpaired", and check afterward that the number of files processed is equal to the number of files encountered (in order to detect a _2 with no corresponding _1).
  • This sort of thing should go to stderr: print "Finish piece ./_%s/%s (%s %s)"
  • chunking size should be configurable (I see that chunking isn't even possible to enable, but this seems like a trivial change)
  • Add brief comment to chunking code to indicate development state (the only clue I see is that it's hard-coded to False, which suggests it is either untested or known to be broken)
run-command
  • This isn't new in this branch, but it looks like an extraneous space: subst.default_subs["link "] = sub_link
  • Print diagnostics to stderr, not stdout
  • There's heavy use of a few different variables called "p" here. Could we have a more descriptive name, at least for the global?
arv-run-pipeline-instance
  • I think .to_json is no longer appropriate here (and in the 3 other similar instances), now that we're asking api_client to serialize body_object for us:
                             :body_object => {
                               :pipeline_instance => attributes.to_json
                             },
Actions #5

Updated by Tom Clegg over 10 years ago

  • Category set to Sample Pipelines
  • Status changed from Resolved to In Progress
Actions #6

Updated by Peter Amstutz over 10 years ago

  • Target version changed from 2014-08-06 Sprint to 2014-08-27 Sprint
Actions #7

Updated by Peter Amstutz over 10 years ago

  • Target version deleted (2014-08-27 Sprint)

Tom Clegg wrote:

In decompress-all.py
  • I think it would be clearer to do all the error handling for re.match right away (and emit an error message!) rather than indenting the rest of the program. I also think "result" variable could be given a more descriptive name. And I think you forgot to import sys. (Thanks, Python.)

Fixed

  • I suspect the third regexp group should have a ? after it, so {hash}/foo.gz works.

Fixed

  • This looks suspiciously like "dtrx failed" is always interpreted as "file is uncompressed, pass through unchanged", which seems overly optimistic
<tetron> brett: does dtrx distinguish between "this file type is unknown/not an archive" and "this file appears to be corrupt?" 
<brett> No.  It is not so easy to distinguish between those cases.
In split_fastq
  • We appear to silently ignore any fastq.gz files that aren't in the top-level stream of the manifest. Why not process everything regardless of stream name? If we can't do that correctly for some reason, it's probably better to fail rather than silently drop data.

Fixed.

  • If paired and unpaired reads appear in the input, we should not silently drop the unpaired reads. Perhaps the logic could be simplified to something like "for each file: if _1 then do a pair; elif _2 then skip; elif fastq then process unpaired", and check afterward that the number of files processed is equal to the number of files encountered (in order to detect a _2 with no corresponding _1).

Fixed.

  • This sort of thing should go to stderr: print "Finish piece ./_%s/%s (%s %s)"

Removed some debugging prints and changed the rest to go to stderr.

  • chunking size should be configurable (I see that chunking isn't even possible to enable, but this seems like a trivial change)

Left that alone until I revisit chunking with a more efficient algorithm and chunking is worth doing.

  • Add brief comment to chunking code to indicate development state (the only clue I see is that it's hard-coded to False, which suggests it is either untested or known to be broken)

Added comment.

run-command
  • This isn't new in this branch, but it looks like an extraneous space: subst.default_subs["link "] = sub_link

That's intentional. In the absence of more sophisticated parsing, the various special case substitutions are detected with str.startswith() so the space is there in avoid situations like matching the beginning of "$(link_name)" and calling it with "_name" as the argument.

  • Print diagnostics to stderr, not stdout

Uses Python logging module now, configured to go to stderr.

  • There's heavy use of a few different variables called "p" here. Could we have a more descriptive name, at least for the global?

Fixed global 'p' is now 'taskp'

arv-run-pipeline-instance
  • I think .to_json is no longer appropriate here (and in the 3 other similar instances), now that we're asking api_client to serialize body_object for us:

Fixed.

Actions #8

Updated by Peter Amstutz over 10 years ago

  • Target version set to 2014-08-27 Sprint
Actions #9

Updated by Tom Clegg over 10 years ago

Peter Amstutz wrote:

  • This looks suspiciously like "dtrx failed" is always interpreted as "file is uncompressed, pass through unchanged", which seems overly optimistic

<tetron> brett: does dtrx distinguish between "this file type is unknown/not an archive" and "this file appears to be corrupt?"
<brett> No. It is not so easy to distinguish between those cases.

We should pattern-match the filename to decide whether to try decompressing it, and then demand the expected behavior. This is more predictable and safer:
  • If something is called *.gz but isn't actually compressed (or decompression fails for some other reason), you fail.
  • If something is called *.txt and is compressed, we pass it through unchanged. If you really expected this to give you transparent decompression, you fail.

It looks like dtrx doesn't have anything like --list-supported-extensions but we can make a pretty good guess that \.(gz|Z|bz2|tgz|tbz|zip|rar|7z|cab|deb|rpm|cpio|gem)$ should work.

I'm pushing on this because I've already seen silently ignored decompression failures result in many days of wasted compute time downstream, not to mention troubleshooting time. Thanks, pigz, for exiting 0 so much.

In this case, if we assume failure means "already uncompressed", we get an unpredictable script which sometimes outputs "foo.txt.gz" and sometimes outputs "foo.txt", depending on whether some transient system error occurred, whether dtrx was installed, etc. Surely we'd rather be predictable, even if it means sacrificing the ability to transparently decompress gzip data that is disguised as *.txt. :P

Everything else LGTM, thanks!

Actions #10

Updated by Anonymous over 10 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 40 to 100

Applied in changeset arvados|commit:c5be0c3abd926adc54e0c1de65e8dfdd25a84ea1.

Actions

Also available in: Atom PDF