Project

General

Profile

Actions

Task #8636

closed

bcbio CWL variant call run problem: workflow cannot make any more progress

Added by Brad Chapman about 8 years ago. Updated about 8 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Target version:
-

Description

Peter;
I'm running into an issue running the more complex bcbio workflow with variant calling. It completes several steps (alignment plus merging) then errors out with:

2016-03-03 14:58:15 arvados.cwl-runner[16205] ERROR: Workflow cannot make any more progress.

This is CWL being run. It does run to completion with the latest cwltool using bcbio Docker but that doesn't mean I didn't do anything wrong:

https://s3.amazonaws.com/bcbio/cwl/test_bcbio_cwl.tar.gz

Here's the run on qr1hi (no obvious errors in the logs there):

https://cloud.curoverse.com/pipeline_instances/qr1hi-d1hrv-7myh0kidb1kdsiv

The the full log from the commandline is below, although not much else there.

Thanks for any tips/ideas about what is happening

$ bash run_arvados.sh
/home/chapmanb/install/bio/arvados/sdk/cwl/miniconda/bin/cwl-runner 1.0.20160302134341
2016-03-03 13:19:01 arvados.arv-run[16205] INFO: Upload local files: "7_100326_FC6107FAAXX.bam.bai" "7_100326_FC6107FAAXX.bam" 
2016-03-03 13:19:03 arvados.arv-run[16205] INFO: Uploaded to qr1hi-4zz18-oechoi9ypexn4yl
2016-03-03 13:19:10 arvados.cwl-runner[16205] INFO: Job prep_align_inputs (qr1hi-8i9sb-pmq750nzfvv8u6q) is Queued
2016-03-03 13:19:10 arvados.arv-run[16205] INFO: Upload local files: "genomes/hg19/seq/hg19.fa" "genomes/hg19/seq/hg19.dict" "automated/variant_regions-bam.bed" "genomes/hg19/seq/hg19.fa.fai" 
2016-03-03 13:19:11 arvados.arv-run[16205] INFO: Uploaded to qr1hi-4zz18-065c7y5qw1vojxw
2016-03-03 13:19:17 arvados.cwl-runner[16205] INFO: Job prep_samples (qr1hi-8i9sb-ybwwhtno6skalm6) is Queued
2016-03-03 13:19:24 arvados.cwl-runner[16205] INFO: Job prep_samples_2 (qr1hi-8i9sb-ok0a411fz8a6jb6) is Queued
2016-03-03 13:19:35 arvados.cwl-runner[16205] INFO: Job prep_align_inputs (qr1hi-8i9sb-pmq750nzfvv8u6q) is Running
2016-03-03 13:19:35 arvados.cwl-runner[16205] INFO: Job prep_samples (qr1hi-8i9sb-ybwwhtno6skalm6) is Running
2016-03-03 13:21:34 arvados.cwl-runner[16205] INFO: Job prep_samples_2 (qr1hi-8i9sb-ok0a411fz8a6jb6) is Running
2016-03-03 13:37:58 arvados.cwl-runner[16205] INFO: Job prep_align_inputs (qr1hi-8i9sb-pmq750nzfvv8u6q) is Complete
2016-03-03 13:38:00 arvados.arv-run[16205] INFO: Upload local files: "GenomeIndexHash" "OverflowTable" "GenomeIndex" "Genome" 
2016-03-03 13:38:01 arvados.arv-run[16205] INFO: Uploaded to qr1hi-4zz18-tafox4zrfoudzf5
2016-03-03 13:38:08 arvados.cwl-runner[16205] INFO: Job process_alignment (qr1hi-8i9sb-lravvepn82oey3b) is Queued
2016-03-03 13:38:08 arvados.cwl-runner[16205] INFO: Job prep_samples (qr1hi-8i9sb-ybwwhtno6skalm6) is Complete
2016-03-03 13:38:18 arvados.cwl-runner[16205] INFO: Job process_alignment_2 (qr1hi-8i9sb-uqbbwcr2pzgkhum) is Queued
2016-03-03 13:38:18 arvados.cwl-runner[16205] INFO: Job process_alignment (qr1hi-8i9sb-lravvepn82oey3b) is Running
2016-03-03 13:38:21 arvados.cwl-runner[16205] INFO: Job process_alignment_2 (qr1hi-8i9sb-uqbbwcr2pzgkhum) is Running
2016-03-03 13:39:51 arvados.cwl-runner[16205] INFO: Job prep_samples_2 (qr1hi-8i9sb-ok0a411fz8a6jb6) is Complete

2016-03-03 13:56:54 arvados.cwl-runner[16205] INFO: Job process_alignment_2 (qr1hi-8i9sb-uqbbwcr2pzgkhum) is Complete
2016-03-03 13:57:03 arvados.cwl-runner[16205] INFO: Job process_alignment (qr1hi-8i9sb-lravvepn82oey3b) is Complete
2016-03-03 13:57:11 arvados.cwl-runner[16205] INFO: Job merge_split_alignments (qr1hi-8i9sb-2ozmr8ifamu1411) is Queued
2016-03-03 13:57:14 arvados.cwl-runner[16205] INFO: Job merge_split_alignments (qr1hi-8i9sb-2ozmr8ifamu1411) is Running
2016-03-03 14:15:51 arvados.cwl-runner[16205] INFO: Job merge_split_alignments (qr1hi-8i9sb-2ozmr8ifamu1411) is Complete
[workflow alignment] outdir is /tmp/user/1001/tmp_81v3E
2016-03-03 14:15:53 arvados.arv-run[16205] INFO: Upload local files: "6_100326_FC6107FAAXX.bam" "6_100326_FC6107FAAXX.bam.bai" 
2016-03-03 14:15:55 arvados.arv-run[16205] INFO: Uploaded to qr1hi-4zz18-wkyoixarjbs8kln
2016-03-03 14:16:01 arvados.cwl-runner[16205] INFO: Job prep_align_inputs_2 (qr1hi-8i9sb-5s94xja7861299a) is Queued
2016-03-03 14:18:14 arvados.cwl-runner[16205] INFO: Job prep_align_inputs_2 (qr1hi-8i9sb-5s94xja7861299a) is Running
2016-03-03 14:36:48 arvados.cwl-runner[16205] INFO: Job prep_align_inputs_2 (qr1hi-8i9sb-5s94xja7861299a) is Complete
2016-03-03 14:36:52 arvados.arv-run[16205] INFO: Upload local files: "hg19.fa.pac" "hg19.fa.sa" "hg19.fa.amb" "hg19.fa.bwt" "hg19.fa.ann" 
2016-03-03 14:36:54 arvados.arv-run[16205] INFO: Uploaded to qr1hi-4zz18-75aiaeo6mx46oco
2016-03-03 14:37:00 arvados.cwl-runner[16205] INFO: Job process_alignment_3 (qr1hi-8i9sb-yej6uab91szv87r) is Queued
2016-03-03 14:37:07 arvados.cwl-runner[16205] INFO: Job process_alignment_4 (qr1hi-8i9sb-r38v6m62tu9do2y) is Queued
2016-03-03 14:37:07 arvados.cwl-runner[16205] INFO: Job process_alignment_3 (qr1hi-8i9sb-yej6uab91szv87r) is Running
2016-03-03 14:37:21 arvados.cwl-runner[16205] INFO: Job process_alignment_3 (qr1hi-8i9sb-yej6uab91szv87r) is Complete
2016-03-03 14:37:24 arvados.cwl-runner[16205] INFO: Job process_alignment_4 (qr1hi-8i9sb-r38v6m62tu9do2y) is Running
2016-03-03 14:37:43 arvados.cwl-runner[16205] INFO: Job process_alignment_4 (qr1hi-8i9sb-r38v6m62tu9do2y) is Complete
2016-03-03 14:37:53 arvados.cwl-runner[16205] INFO: Job merge_split_alignments_2 (qr1hi-8i9sb-kk1ha59ov9vcr90) is Queued
2016-03-03 14:39:33 arvados.cwl-runner[16205] INFO: Job merge_split_alignments_2 (qr1hi-8i9sb-kk1ha59ov9vcr90) is Running
2016-03-03 14:58:10 arvados.cwl-runner[16205] INFO: Job merge_split_alignments_2 (qr1hi-8i9sb-kk1ha59ov9vcr90) is Complete
[workflow alignment_2] outdir is /tmp/user/1001/tmpFayze_
2016-03-03 14:58:15 arvados.cwl-runner[16205] ERROR: Workflow cannot make any more progress.
2016-03-03 14:58:15 arvados.cwl-runner[16205] ERROR: Caught unhandled exception, marking pipeline as failed
Traceback (most recent call last):
  File "build/bdist.linux-x86_64/egg/arvados_cwl/__init__.py", line 413, in arvExecutor
    raise cwltool.workflow.WorkflowException("Workflow did not return a result.")
WorkflowException: Workflow did not return a result.
Actions #1

Updated by Brad Chapman about 8 years ago

Peter;
Thanks for the debugging session yesterday. I re-ran am also getting further than before. Awesome. I fixed the issue with using a directory and also got past the `batch_for_variantcall` stage where you were stuck. Here is an updated pipeline (sorry, not with Arvados tracking yet -- need to get an automated way to do that -- but it does have a VERSION file with the date):

https://s3.amazonaws.com/bcbio/cwl/test_bcbio_cwl.tar.gz

The run I'm on is here:

https://cloud.curoverse.com/pipeline_instances/qr1hi-d1hrv-wvhnu50v15p6lw6

and I get stuck on the next step using the results from `batch_for_variantcall`. batch_for_variantcall returns records which is the major change here. Either I'm not returning them right, or they are not handled correctly at some point. I get this error:

Exception on step 'get_parallel_regions'
Exception on step 'variantcall'
2016-03-08 09:13:38 arvados.cwl-runner[3444] ERROR: Caught unhandled exception, marking pipeline as failed
Traceback (most recent call last):
  File "build/bdist.linux-x86_64/egg/arvados_cwl/__init__.py", line 384, in arvExecutor
    for runnable in jobiter:
  File "build/bdist.linux-x86_64/egg/cwltool/workflow.py", line 370, in job
    for w in wj.job(builder.job, basedir, output_callback, **kwargs):
  File "build/bdist.linux-x86_64/egg/cwltool/workflow.py", line 287, in job
    for newjob in step.iterable:
  File "build/bdist.linux-x86_64/egg/cwltool/workflow.py", line 243, in try_make_job
    for j in jobs:
  File "build/bdist.linux-x86_64/egg/cwltool/workflow.py", line 528, in dotproduct_scatter
    for j in process.job(jo, basedir, functools.partial(rc.receive_scatter_output, n), **kwargs):
  File "build/bdist.linux-x86_64/egg/cwltool/workflow.py", line 145, in job
    for j in self.step.job(joborder, basedir, output_callback, **kwargs):
  File "build/bdist.linux-x86_64/egg/cwltool/workflow.py", line 472, in job
    **kwargs):
  File "build/bdist.linux-x86_64/egg/cwltool/workflow.py", line 370, in job
    for w in wj.job(builder.job, basedir, output_callback, **kwargs):
  File "build/bdist.linux-x86_64/egg/cwltool/workflow.py", line 287, in job
    for newjob in step.iterable:
  File "build/bdist.linux-x86_64/egg/cwltool/workflow.py", line 243, in try_make_job
    for j in jobs:
  File "build/bdist.linux-x86_64/egg/cwltool/workflow.py", line 145, in job
    for j in self.step.job(joborder, basedir, output_callback, **kwargs):
  File "build/bdist.linux-x86_64/egg/cwltool/workflow.py", line 472, in job
    **kwargs):
  File "build/bdist.linux-x86_64/egg/cwltool/draft2tool.py", line 166, in job
    builder.pathmapper = self.makePathMapper(reffiles, input_basedir, **kwargs)
  File "build/bdist.linux-x86_64/egg/arvados_cwl/__init__.py", line 285, in makePathMapper
    return ArvPathMapper(self.arvrunner, reffiles, input_basedir, **kwargs)
  File "build/bdist.linux-x86_64/egg/arvados_cwl/__init__.py", line 252, in __init__
    raise cwltool.workflow.WorkflowException("Input file path '%s' is invalid" % st)
WorkflowException: Input file path '/home/chapmanb/bio/bcbio-nextgen/tests/test_bcbio_cwl/run_info-cwl-workflow/e' is invalid

If I add some debugging to print out `builder.files` and reffiles in cwltool/draft2tool (https://github.com/common-workflow-language/cwltool/blob/893de464b3dd6f359e39ee00d8d4f69cfdb1e1bc/cwltool/draft2tool.py#L161) I see a lot of e's. The file structure looks to have remained in place but all the paths got converted over to e:

set([u'e'])
[{u'class': u'File',
  u'path': u'e',
  u'secondaryFiles': [{u'class': u'File', u'path': u'e'},
                      {u'class': u'File', u'path': u'e'}]},
 {u'class': u'File', u'path': u'e'},
 {u'class': u'File', u'path': u'e'},
 {u'class': u'File',
  u'path': u'e',
  u'secondaryFiles': [{u'class': u'File', u'path': u'e'},
                      {u'class': u'File', u'path': u'e'}]},
 {u'class': u'File', u'path': u'e'},
 {u'class': u'File', u'path': u'e'},
 {u'class': u'File', u'path': u'e'},
 {u'class': u'File', u'path': u'e'},
 {u'class': u'File', u'path': u'e'},
 {u'class': u'File', u'path': u'e'},
 {u'class': u'File',
  u'path': u'e',
  u'secondaryFiles': [{u'class': u'File', u'path': u'e'}]},
 {u'class': u'File', u'path': u'e'},
 {u'class': u'File',
  u'path': u'e',
  u'secondaryFiles': [{u'class': u'File', u'path': u'e'}]},
 {u'class': u'File', u'path': u'e'},
 {u'class': u'File',
  u'path': u'e',
  u'secondaryFiles': [{u'class': u'File', u'path': u'e'}]},
 {u'class': u'File', u'path': u'e'},
 {u'class': u'File',
  u'path': u'e',
  u'secondaryFiles': [{u'class': u'File', u'path': u'e'}]},
 {u'class': u'File', u'path': u'e'},
 {u'class': u'File', u'path': u'e'},
 {u'class': u'File', u'path': u'e'},
 {u'class': u'File',
  u'path': u'e',
  u'secondaryFiles': [{u'class': u'File', u'path': u'e'}]},
 {u'class': u'File', u'path': u'e'},
 {u'class': u'File',
  u'path': u'e',
  u'secondaryFiles': [{u'class': u'File', u'path': u'e'}]},
 {u'class': u'File', u'path': u'e'},
 {u'class': u'File', u'path': u'e'},
 {u'class': u'File', u'path': u'e'},
 {u'class': u'File', u'path': u'e'},
 {u'class': u'File', u'path': u'e'}]
Actions #2

Updated by Brad Chapman about 8 years ago

Peter;
We have a successful run. Awesome work:

https://cloud.curoverse.com/pipeline_instances/qr1hi-d1hrv-3q50swaspsu3mn8

The two issues I'm still running into:

- I got the ERROR: Workflow cannot make any more progress failure two additional times when running the eventually successful pipeline. It was the same CWL, same docker image, same cwl-runner/cwltool, just literally restarting again right after. For the successful run it failed twice and then worked fine the third time. So there is some kind of issue still lurking but I don't know how to reproduce other than running a few times. Unfortunately the third time when I ran with --debug to get more information it finished successfully so I don't have additional details.

- Re-use of previously run pipelines works intermittently as well. For the two re-runs above some tasks re-used (immediately returned Completed on the command line) which others did not. I didn't notice an obvious pattern, but maybe there is an easier way that interpreting the logs to check what re-used and what didn't. These were again runs with the same CWL and docker image, and the docker image includes the sort_keys=True fixes for JSON dumping from bcbio.

Thanks for all the support on this. Excited to have a working run.

Actions #3

Updated by Brad Chapman about 8 years ago

Peter;
For help with debugging here are two pipelines run with the same CWL, bcbio image and cwl-runner. The first failed with the `ERROR: Workflow cannot make any progress` issue:

https://cloud.curoverse.com/pipeline_instances/qr1hi-d1hrv-y1z1nxb2enndigu#

The second, run immediately after, failed with the same error and also only re-used some components:

https://cloud.curoverse.com/pipeline_instances/qr1hi-d1hrv-qmxnsnfa6l5fuu9#

8 of the 10 steps were re-run. It looks like prep_align_inputs and prep_align_inputs_2 were reused but not the others.

Actions #4

Updated by Brad Chapman about 8 years ago

  • Status changed from New to Resolved
  • Remaining (hours) set to 0.0

bcbio/CWL/Arvados process moved to use uploading of inputs and reference data:

https://github.com/chapmanb/bcbio-nextgen/tree/master/cwl#running-bcbio-cwl-on-arvados

This resolves the re-use problems and I haven't seen the `ERROR: workflow cannot make any more progress` problem that initiated this thread. Recent runs works cleanly:

https://cloud.curoverse.com/pipeline_instances/qr1hi-d1hrv-0fkncxo7asjw3jh

Thanks for all the help with the process.

Actions

Also available in: Atom PDF