Task #8636
closedbcbio CWL variant call run problem: workflow cannot make any more progress
Description
Peter;
I'm running into an issue running the more complex bcbio workflow with variant calling. It completes several steps (alignment plus merging) then errors out with:
2016-03-03 14:58:15 arvados.cwl-runner[16205] ERROR: Workflow cannot make any more progress.
This is CWL being run. It does run to completion with the latest cwltool using bcbio Docker but that doesn't mean I didn't do anything wrong:
https://s3.amazonaws.com/bcbio/cwl/test_bcbio_cwl.tar.gz
Here's the run on qr1hi (no obvious errors in the logs there):
https://cloud.curoverse.com/pipeline_instances/qr1hi-d1hrv-7myh0kidb1kdsiv
The the full log from the commandline is below, although not much else there.
Thanks for any tips/ideas about what is happening
$ bash run_arvados.sh /home/chapmanb/install/bio/arvados/sdk/cwl/miniconda/bin/cwl-runner 1.0.20160302134341 2016-03-03 13:19:01 arvados.arv-run[16205] INFO: Upload local files: "7_100326_FC6107FAAXX.bam.bai" "7_100326_FC6107FAAXX.bam" 2016-03-03 13:19:03 arvados.arv-run[16205] INFO: Uploaded to qr1hi-4zz18-oechoi9ypexn4yl 2016-03-03 13:19:10 arvados.cwl-runner[16205] INFO: Job prep_align_inputs (qr1hi-8i9sb-pmq750nzfvv8u6q) is Queued 2016-03-03 13:19:10 arvados.arv-run[16205] INFO: Upload local files: "genomes/hg19/seq/hg19.fa" "genomes/hg19/seq/hg19.dict" "automated/variant_regions-bam.bed" "genomes/hg19/seq/hg19.fa.fai" 2016-03-03 13:19:11 arvados.arv-run[16205] INFO: Uploaded to qr1hi-4zz18-065c7y5qw1vojxw 2016-03-03 13:19:17 arvados.cwl-runner[16205] INFO: Job prep_samples (qr1hi-8i9sb-ybwwhtno6skalm6) is Queued 2016-03-03 13:19:24 arvados.cwl-runner[16205] INFO: Job prep_samples_2 (qr1hi-8i9sb-ok0a411fz8a6jb6) is Queued 2016-03-03 13:19:35 arvados.cwl-runner[16205] INFO: Job prep_align_inputs (qr1hi-8i9sb-pmq750nzfvv8u6q) is Running 2016-03-03 13:19:35 arvados.cwl-runner[16205] INFO: Job prep_samples (qr1hi-8i9sb-ybwwhtno6skalm6) is Running 2016-03-03 13:21:34 arvados.cwl-runner[16205] INFO: Job prep_samples_2 (qr1hi-8i9sb-ok0a411fz8a6jb6) is Running 2016-03-03 13:37:58 arvados.cwl-runner[16205] INFO: Job prep_align_inputs (qr1hi-8i9sb-pmq750nzfvv8u6q) is Complete 2016-03-03 13:38:00 arvados.arv-run[16205] INFO: Upload local files: "GenomeIndexHash" "OverflowTable" "GenomeIndex" "Genome" 2016-03-03 13:38:01 arvados.arv-run[16205] INFO: Uploaded to qr1hi-4zz18-tafox4zrfoudzf5 2016-03-03 13:38:08 arvados.cwl-runner[16205] INFO: Job process_alignment (qr1hi-8i9sb-lravvepn82oey3b) is Queued 2016-03-03 13:38:08 arvados.cwl-runner[16205] INFO: Job prep_samples (qr1hi-8i9sb-ybwwhtno6skalm6) is Complete 2016-03-03 13:38:18 arvados.cwl-runner[16205] INFO: Job process_alignment_2 (qr1hi-8i9sb-uqbbwcr2pzgkhum) is Queued 2016-03-03 13:38:18 arvados.cwl-runner[16205] INFO: Job process_alignment (qr1hi-8i9sb-lravvepn82oey3b) is Running 2016-03-03 13:38:21 arvados.cwl-runner[16205] INFO: Job process_alignment_2 (qr1hi-8i9sb-uqbbwcr2pzgkhum) is Running 2016-03-03 13:39:51 arvados.cwl-runner[16205] INFO: Job prep_samples_2 (qr1hi-8i9sb-ok0a411fz8a6jb6) is Complete 2016-03-03 13:56:54 arvados.cwl-runner[16205] INFO: Job process_alignment_2 (qr1hi-8i9sb-uqbbwcr2pzgkhum) is Complete 2016-03-03 13:57:03 arvados.cwl-runner[16205] INFO: Job process_alignment (qr1hi-8i9sb-lravvepn82oey3b) is Complete 2016-03-03 13:57:11 arvados.cwl-runner[16205] INFO: Job merge_split_alignments (qr1hi-8i9sb-2ozmr8ifamu1411) is Queued 2016-03-03 13:57:14 arvados.cwl-runner[16205] INFO: Job merge_split_alignments (qr1hi-8i9sb-2ozmr8ifamu1411) is Running 2016-03-03 14:15:51 arvados.cwl-runner[16205] INFO: Job merge_split_alignments (qr1hi-8i9sb-2ozmr8ifamu1411) is Complete [workflow alignment] outdir is /tmp/user/1001/tmp_81v3E 2016-03-03 14:15:53 arvados.arv-run[16205] INFO: Upload local files: "6_100326_FC6107FAAXX.bam" "6_100326_FC6107FAAXX.bam.bai" 2016-03-03 14:15:55 arvados.arv-run[16205] INFO: Uploaded to qr1hi-4zz18-wkyoixarjbs8kln 2016-03-03 14:16:01 arvados.cwl-runner[16205] INFO: Job prep_align_inputs_2 (qr1hi-8i9sb-5s94xja7861299a) is Queued 2016-03-03 14:18:14 arvados.cwl-runner[16205] INFO: Job prep_align_inputs_2 (qr1hi-8i9sb-5s94xja7861299a) is Running 2016-03-03 14:36:48 arvados.cwl-runner[16205] INFO: Job prep_align_inputs_2 (qr1hi-8i9sb-5s94xja7861299a) is Complete 2016-03-03 14:36:52 arvados.arv-run[16205] INFO: Upload local files: "hg19.fa.pac" "hg19.fa.sa" "hg19.fa.amb" "hg19.fa.bwt" "hg19.fa.ann" 2016-03-03 14:36:54 arvados.arv-run[16205] INFO: Uploaded to qr1hi-4zz18-75aiaeo6mx46oco 2016-03-03 14:37:00 arvados.cwl-runner[16205] INFO: Job process_alignment_3 (qr1hi-8i9sb-yej6uab91szv87r) is Queued 2016-03-03 14:37:07 arvados.cwl-runner[16205] INFO: Job process_alignment_4 (qr1hi-8i9sb-r38v6m62tu9do2y) is Queued 2016-03-03 14:37:07 arvados.cwl-runner[16205] INFO: Job process_alignment_3 (qr1hi-8i9sb-yej6uab91szv87r) is Running 2016-03-03 14:37:21 arvados.cwl-runner[16205] INFO: Job process_alignment_3 (qr1hi-8i9sb-yej6uab91szv87r) is Complete 2016-03-03 14:37:24 arvados.cwl-runner[16205] INFO: Job process_alignment_4 (qr1hi-8i9sb-r38v6m62tu9do2y) is Running 2016-03-03 14:37:43 arvados.cwl-runner[16205] INFO: Job process_alignment_4 (qr1hi-8i9sb-r38v6m62tu9do2y) is Complete 2016-03-03 14:37:53 arvados.cwl-runner[16205] INFO: Job merge_split_alignments_2 (qr1hi-8i9sb-kk1ha59ov9vcr90) is Queued 2016-03-03 14:39:33 arvados.cwl-runner[16205] INFO: Job merge_split_alignments_2 (qr1hi-8i9sb-kk1ha59ov9vcr90) is Running 2016-03-03 14:58:10 arvados.cwl-runner[16205] INFO: Job merge_split_alignments_2 (qr1hi-8i9sb-kk1ha59ov9vcr90) is Complete [workflow alignment_2] outdir is /tmp/user/1001/tmpFayze_ 2016-03-03 14:58:15 arvados.cwl-runner[16205] ERROR: Workflow cannot make any more progress. 2016-03-03 14:58:15 arvados.cwl-runner[16205] ERROR: Caught unhandled exception, marking pipeline as failed Traceback (most recent call last): File "build/bdist.linux-x86_64/egg/arvados_cwl/__init__.py", line 413, in arvExecutor raise cwltool.workflow.WorkflowException("Workflow did not return a result.") WorkflowException: Workflow did not return a result.
Updated by Brad Chapman about 8 years ago
Peter;
Thanks for the debugging session yesterday. I re-ran am also getting further than before. Awesome. I fixed the issue with using a directory and also got past the `batch_for_variantcall` stage where you were stuck. Here is an updated pipeline (sorry, not with Arvados tracking yet -- need to get an automated way to do that -- but it does have a VERSION file with the date):
https://s3.amazonaws.com/bcbio/cwl/test_bcbio_cwl.tar.gz
The run I'm on is here:
https://cloud.curoverse.com/pipeline_instances/qr1hi-d1hrv-wvhnu50v15p6lw6
and I get stuck on the next step using the results from `batch_for_variantcall`. batch_for_variantcall returns records which is the major change here. Either I'm not returning them right, or they are not handled correctly at some point. I get this error:
Exception on step 'get_parallel_regions' Exception on step 'variantcall' 2016-03-08 09:13:38 arvados.cwl-runner[3444] ERROR: Caught unhandled exception, marking pipeline as failed Traceback (most recent call last): File "build/bdist.linux-x86_64/egg/arvados_cwl/__init__.py", line 384, in arvExecutor for runnable in jobiter: File "build/bdist.linux-x86_64/egg/cwltool/workflow.py", line 370, in job for w in wj.job(builder.job, basedir, output_callback, **kwargs): File "build/bdist.linux-x86_64/egg/cwltool/workflow.py", line 287, in job for newjob in step.iterable: File "build/bdist.linux-x86_64/egg/cwltool/workflow.py", line 243, in try_make_job for j in jobs: File "build/bdist.linux-x86_64/egg/cwltool/workflow.py", line 528, in dotproduct_scatter for j in process.job(jo, basedir, functools.partial(rc.receive_scatter_output, n), **kwargs): File "build/bdist.linux-x86_64/egg/cwltool/workflow.py", line 145, in job for j in self.step.job(joborder, basedir, output_callback, **kwargs): File "build/bdist.linux-x86_64/egg/cwltool/workflow.py", line 472, in job **kwargs): File "build/bdist.linux-x86_64/egg/cwltool/workflow.py", line 370, in job for w in wj.job(builder.job, basedir, output_callback, **kwargs): File "build/bdist.linux-x86_64/egg/cwltool/workflow.py", line 287, in job for newjob in step.iterable: File "build/bdist.linux-x86_64/egg/cwltool/workflow.py", line 243, in try_make_job for j in jobs: File "build/bdist.linux-x86_64/egg/cwltool/workflow.py", line 145, in job for j in self.step.job(joborder, basedir, output_callback, **kwargs): File "build/bdist.linux-x86_64/egg/cwltool/workflow.py", line 472, in job **kwargs): File "build/bdist.linux-x86_64/egg/cwltool/draft2tool.py", line 166, in job builder.pathmapper = self.makePathMapper(reffiles, input_basedir, **kwargs) File "build/bdist.linux-x86_64/egg/arvados_cwl/__init__.py", line 285, in makePathMapper return ArvPathMapper(self.arvrunner, reffiles, input_basedir, **kwargs) File "build/bdist.linux-x86_64/egg/arvados_cwl/__init__.py", line 252, in __init__ raise cwltool.workflow.WorkflowException("Input file path '%s' is invalid" % st) WorkflowException: Input file path '/home/chapmanb/bio/bcbio-nextgen/tests/test_bcbio_cwl/run_info-cwl-workflow/e' is invalid
If I add some debugging to print out `builder.files` and reffiles in cwltool/draft2tool (https://github.com/common-workflow-language/cwltool/blob/893de464b3dd6f359e39ee00d8d4f69cfdb1e1bc/cwltool/draft2tool.py#L161) I see a lot of e's. The file structure looks to have remained in place but all the paths got converted over to e:
set([u'e']) [{u'class': u'File', u'path': u'e', u'secondaryFiles': [{u'class': u'File', u'path': u'e'}, {u'class': u'File', u'path': u'e'}]}, {u'class': u'File', u'path': u'e'}, {u'class': u'File', u'path': u'e'}, {u'class': u'File', u'path': u'e', u'secondaryFiles': [{u'class': u'File', u'path': u'e'}, {u'class': u'File', u'path': u'e'}]}, {u'class': u'File', u'path': u'e'}, {u'class': u'File', u'path': u'e'}, {u'class': u'File', u'path': u'e'}, {u'class': u'File', u'path': u'e'}, {u'class': u'File', u'path': u'e'}, {u'class': u'File', u'path': u'e'}, {u'class': u'File', u'path': u'e', u'secondaryFiles': [{u'class': u'File', u'path': u'e'}]}, {u'class': u'File', u'path': u'e'}, {u'class': u'File', u'path': u'e', u'secondaryFiles': [{u'class': u'File', u'path': u'e'}]}, {u'class': u'File', u'path': u'e'}, {u'class': u'File', u'path': u'e', u'secondaryFiles': [{u'class': u'File', u'path': u'e'}]}, {u'class': u'File', u'path': u'e'}, {u'class': u'File', u'path': u'e', u'secondaryFiles': [{u'class': u'File', u'path': u'e'}]}, {u'class': u'File', u'path': u'e'}, {u'class': u'File', u'path': u'e'}, {u'class': u'File', u'path': u'e'}, {u'class': u'File', u'path': u'e', u'secondaryFiles': [{u'class': u'File', u'path': u'e'}]}, {u'class': u'File', u'path': u'e'}, {u'class': u'File', u'path': u'e', u'secondaryFiles': [{u'class': u'File', u'path': u'e'}]}, {u'class': u'File', u'path': u'e'}, {u'class': u'File', u'path': u'e'}, {u'class': u'File', u'path': u'e'}, {u'class': u'File', u'path': u'e'}, {u'class': u'File', u'path': u'e'}]
Updated by Brad Chapman about 8 years ago
Peter;
We have a successful run. Awesome work:
https://cloud.curoverse.com/pipeline_instances/qr1hi-d1hrv-3q50swaspsu3mn8
The two issues I'm still running into:
- I got the ERROR: Workflow cannot make any more progress failure two additional times when running the eventually successful pipeline. It was the same CWL, same docker image, same cwl-runner/cwltool, just literally restarting again right after. For the successful run it failed twice and then worked fine the third time. So there is some kind of issue still lurking but I don't know how to reproduce other than running a few times. Unfortunately the third time when I ran with --debug to get more information it finished successfully so I don't have additional details.
- Re-use of previously run pipelines works intermittently as well. For the two re-runs above some tasks re-used (immediately returned Completed on the command line) which others did not. I didn't notice an obvious pattern, but maybe there is an easier way that interpreting the logs to check what re-used and what didn't. These were again runs with the same CWL and docker image, and the docker image includes the sort_keys=True fixes for JSON dumping from bcbio.
Thanks for all the support on this. Excited to have a working run.
Updated by Brad Chapman about 8 years ago
Peter;
For help with debugging here are two pipelines run with the same CWL, bcbio image and cwl-runner. The first failed with the `ERROR: Workflow cannot make any progress` issue:
https://cloud.curoverse.com/pipeline_instances/qr1hi-d1hrv-y1z1nxb2enndigu#
The second, run immediately after, failed with the same error and also only re-used some components:
https://cloud.curoverse.com/pipeline_instances/qr1hi-d1hrv-qmxnsnfa6l5fuu9#
8 of the 10 steps were re-run. It looks like prep_align_inputs and prep_align_inputs_2 were reused but not the others.
Updated by Brad Chapman about 8 years ago
- Status changed from New to Resolved
- Remaining (hours) set to 0.0
bcbio/CWL/Arvados process moved to use uploading of inputs and reference data:
https://github.com/chapmanb/bcbio-nextgen/tree/master/cwl#running-bcbio-cwl-on-arvados
This resolves the re-use problems and I haven't seen the `ERROR: workflow cannot make any more progress` problem that initiated this thread. Recent runs works cleanly:
https://cloud.curoverse.com/pipeline_instances/qr1hi-d1hrv-0fkncxo7asjw3jh
Thanks for all the help with the process.