Bug #3373
closed[Sample pipelines] Improve SNV pipeline to accept example exome fastq data (2 pairs of reads) as a single input collection.
Updated by Peter Amstutz over 10 years ago
- Status changed from New to In Progress
Updated by Peter Amstutz over 10 years ago
- Status changed from In Progress to Resolved
Updated by Tom Clegg over 10 years ago
- I think it would be clearer to do all the error handling for re.match right away (and emit an error message!) rather than indenting the rest of the program. I also think "result" variable could be given a more descriptive name. And I think you forgot to import sys. (Thanks, Python.)
import sys ... infile_parts = re.match(r"(^[a-f0-9]{32}\+\d+)(\+\S+)*(/.*)(/[^/]+)$", input_file) if infile_parts == None: print >>sys.stderr, "Failed to parse input filename '%s' as a Keep file\n" % input_file sys.exit(1) cr = arvados.CollectionReader(infile_parts.group(1)) ...
- I suspect the third regexp group should have a
?
after it, so {hash}/foo.gz works. - This looks suspiciously like "dtrx failed" is always interpreted as "file is uncompressed, pass through unchanged", which seems overly optimistic:
if rc == 0: out = arvados.CollectionWriter() out.write_directory_tree(outdir, max_manifest_depth=0) task.set_output(out.finish()) else: task.set_output(streamname + filereader.as_manifest()[1:])
- We appear to silently ignore any
fastq.gz
files that aren't in the top-level stream of the manifest. Why not process everything regardless of stream name? If we can't do that correctly for some reason, it's probably better to fail rather than silently drop data. - If paired and unpaired reads appear in the input, we should not silently drop the unpaired reads. Perhaps the logic could be simplified to something like "for each file: if _1 then do a pair; elif _2 then skip; elif fastq then process unpaired", and check afterward that the number of files processed is equal to the number of files encountered (in order to detect a _2 with no corresponding _1).
- This sort of thing should go to stderr:
print "Finish piece ./_%s/%s (%s %s)"
- chunking size should be configurable (I see that chunking isn't even possible to enable, but this seems like a trivial change)
- Add brief comment to chunking code to indicate development state (the only clue I see is that it's hard-coded to False, which suggests it is either untested or known to be broken)
- This isn't new in this branch, but it looks like an extraneous space:
subst.default_subs["link "] = sub_link
- Print diagnostics to stderr, not stdout
- There's heavy use of a few different variables called "p" here. Could we have a more descriptive name, at least for the global?
- I think
.to_json
is no longer appropriate here (and in the 3 other similar instances), now that we're asking api_client to serialize body_object for us:
:body_object => {
:pipeline_instance => attributes.to_json
},
Updated by Tom Clegg over 10 years ago
- Category set to Sample Pipelines
- Status changed from Resolved to In Progress
Updated by Peter Amstutz over 10 years ago
- Target version changed from 2014-08-06 Sprint to 2014-08-27 Sprint
Updated by Peter Amstutz over 10 years ago
- Target version deleted (
2014-08-27 Sprint)
Tom Clegg wrote:
In decompress-all.py
- I think it would be clearer to do all the error handling for re.match right away (and emit an error message!) rather than indenting the rest of the program. I also think "result" variable could be given a more descriptive name. And I think you forgot to import sys. (Thanks, Python.)
Fixed
- I suspect the third regexp group should have a
?
after it, so {hash}/foo.gz works.
Fixed
- This looks suspiciously like "dtrx failed" is always interpreted as "file is uncompressed, pass through unchanged", which seems overly optimistic
<tetron> brett: does dtrx distinguish between "this file type is unknown/not an archive" and "this file appears to be corrupt?" <brett> No. It is not so easy to distinguish between those cases.
In split_fastq
- We appear to silently ignore any
fastq.gz
files that aren't in the top-level stream of the manifest. Why not process everything regardless of stream name? If we can't do that correctly for some reason, it's probably better to fail rather than silently drop data.
Fixed.
- If paired and unpaired reads appear in the input, we should not silently drop the unpaired reads. Perhaps the logic could be simplified to something like "for each file: if _1 then do a pair; elif _2 then skip; elif fastq then process unpaired", and check afterward that the number of files processed is equal to the number of files encountered (in order to detect a _2 with no corresponding _1).
Fixed.
- This sort of thing should go to stderr:
print "Finish piece ./_%s/%s (%s %s)"
Removed some debugging prints and changed the rest to go to stderr.
- chunking size should be configurable (I see that chunking isn't even possible to enable, but this seems like a trivial change)
Left that alone until I revisit chunking with a more efficient algorithm and chunking is worth doing.
- Add brief comment to chunking code to indicate development state (the only clue I see is that it's hard-coded to False, which suggests it is either untested or known to be broken)
Added comment.
run-command
- This isn't new in this branch, but it looks like an extraneous space:
subst.default_subs["link "] = sub_link
That's intentional. In the absence of more sophisticated parsing, the various special case substitutions are detected with str.startswith() so the space is there in avoid situations like matching the beginning of "$(link_name)" and calling it with "_name" as the argument.
- Print diagnostics to stderr, not stdout
Uses Python logging module now, configured to go to stderr.
- There's heavy use of a few different variables called "p" here. Could we have a more descriptive name, at least for the global?
Fixed global 'p' is now 'taskp'
arv-run-pipeline-instance
- I think
.to_json
is no longer appropriate here (and in the 3 other similar instances), now that we're asking api_client to serialize body_object for us:
Fixed.
Updated by Peter Amstutz over 10 years ago
- Target version set to 2014-08-27 Sprint
Updated by Tom Clegg over 10 years ago
Peter Amstutz wrote:
We should pattern-match the filename to decide whether to try decompressing it, and then demand the expected behavior. This is more predictable and safer:
- This looks suspiciously like "dtrx failed" is always interpreted as "file is uncompressed, pass through unchanged", which seems overly optimistic
<tetron> brett: does dtrx distinguish between "this file type is unknown/not an archive" and "this file appears to be corrupt?"
<brett> No. It is not so easy to distinguish between those cases.
- If something is called
*.gz
but isn't actually compressed (or decompression fails for some other reason), you fail. - If something is called
*.txt
and is compressed, we pass it through unchanged. If you really expected this to give you transparent decompression, you fail.
It looks like dtrx doesn't have anything like --list-supported-extensions
but we can make a pretty good guess that \.(gz|Z|bz2|tgz|tbz|zip|rar|7z|cab|deb|rpm|cpio|gem)$
should work.
I'm pushing on this because I've already seen silently ignored decompression failures result in many days of wasted compute time downstream, not to mention troubleshooting time. Thanks, pigz, for exiting 0 so much.
In this case, if we assume failure means "already uncompressed", we get an unpredictable script which sometimes outputs "foo.txt.gz" and sometimes outputs "foo.txt", depending on whether some transient system error occurred, whether dtrx was installed, etc. Surely we'd rather be predictable, even if it means sacrificing the ability to transparently decompress gzip data that is disguised as *.txt
. :P
Everything else LGTM, thanks!
Updated by Anonymous over 10 years ago
- Status changed from In Progress to Resolved
- % Done changed from 40 to 100
Applied in changeset arvados|commit:c5be0c3abd926adc54e0c1de65e8dfdd25a84ea1.