Bug #11495

bcbio NA12878 validation runs: job re-use failure with non-existant collection

Added by Brad Chapman 2 months ago. Updated 2 months ago.

Status:NewStart date:04/13/2017
Priority:NormalDue date:
Assignee:-% Done:

0%

Category:-
Target version:-
Story points-
Velocity based estimate-

Description

Due to failures reported in #11494 I tried to re-run the bcbio CWL validation pipeline with re-use enabled and I get a failure
accessing one of the collections:
```
2017-04-13 09:16:39 arvados.cwl-runner INFO: Pipeline instance qr1hi-d1hrv-7lnm3bklaagipg6
112017-04-13 09:17:02 arvados.cwl-runner INFO: [job prep_samples_to_rec] qr1hi-8i9sb-xv69ktrwccodeja is Queued
2017-04-13 09:17:03 arvados.cwl-runner INFO: [job alignment_to_rec] reused job qr1hi-8i9sb-r83xspm4oxsvm7v
2017-04-13 09:17:08 arvados.cwl-runner ERROR: Got unknown exception while collecting output for job alignment_to_rec:
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/arvados_cwl/arvjob.py", line 195, in done
num_retries=self.arvrunner.num_retries)
File "/usr/lib/python2.7/dist-packages/arvados/collection.py", line 1680, in init
super(CollectionReader, self).__init__(manifest_locator_or_text, *args, **kwargs)
File "/usr/lib/python2.7/dist-packages/arvados/collection.py", line 1236, in init
self._populate()
File "/usr/lib/python2.7/dist-packages/arvados/collection.py", line 1367, in _populate
error_via_keep))
NotFoundError: Failed to retrieve collection '9385673b342238f0cd9b7251b725a4ed+85' from either API server (<HttpError 404 when requesting https://qr1hi.arvadosapi.com/arvados/v1/collections/9385673b342238f0cd9b7251b725a4ed%2B85?alt=json returned "Path not found">) or Keep (9385673b342238f0cd9b7251b725a4ed+85 not found: http://keep23.qr1hi.arvadosapi.com:25107/ responded with 403 HTTP/1.1 403 Forbidden
; http://keep24.qr1hi.arvadosapi.com:25107/ responded with 403 HTTP/1.1 403 Forbidden
; http://keep27.qr1hi.arvadosapi.com:25107/ responded with 403 HTTP/1.1 403 Forbidden
; http://keep20.qr1hi.arvadosapi.com:25107/ responded with 403 HTTP/1.1 403 Forbidden
; http://keep21.qr1hi.arvadosapi.com:25107/ responded with 403 HTTP/1.1 403 Forbidden
; http://keep25.qr1hi.arvadosapi.com:25107/ responded with 403 HTTP/1.1 403 Forbidden
; http://keep22.qr1hi.arvadosapi.com:25107/ responded with 403 HTTP/1.1 403 Forbidden
; http://keep26.qr1hi.arvadosapi.com:25107/ responded with 403 HTTP/1.1 403 Forbidden
).
2017-04-13 09:17:08 cwltool ERROR: [step alignment_to_rec] Output is missing expected field file:///home/bchapman/runs/NA12878-platinum-chr20-workflow-arvados/main-NA12878-platinum-chr20.cwl#alignment_to_rec/alignment_rec
2017-04-13 09:17:08 cwltool WARNING: [step alignment_to_rec] completed permanentFail
2017-04-13 09:17:08 cwltool INFO: [workflow main-NA12878-platinum-chr20.cwl] outdir is $(task.outdir)
2017-04-13 09:17:08 arvados.cwl-runner WARNING: Overall process status is permanentFail
```
The UUID for the previous runs output is 79e25208a213b812dddda032e28eac07+223:

https://cloud.curoverse.com/jobs/qr1hi-8i9sb-cazntclejrng7dg#Status

so I'm not sure where the collection hash it requests above, which does not exist, comes from.

History

#1 Updated by Bryan Cosca 2 months ago

It's a missing log file.

2017-04-13 09:17:03 arvados.cwl-runner INFO: [job alignment_to_rec] reused job qr1hi-8i9sb-r83xspm4oxsvm7v
2017-04-13 09:17:08 arvados.cwl-runner ERROR: Got unknown exception while collecting output for job alignment_to_rec:

https://cloud.curoverse.com/jobs/qr1hi-8i9sb-r83xspm4oxsvm7v#Log

Oh... fiddlesticks.

An error occurred when Workbench sent a request to the Arvados API server. Try reloading this page. If the problem is temporary, your request might go through next time. If that doesn't work, the information below can help system administrators track down the problem.

API request URL
https://qr1hi.arvadosapi.com/arvados/v1/collections/9385673b342238f0cd9b7251b725a4ed+85
API response
{
  ":errors":[
    "Path not found" 
  ],
  ":error_token":"1492103219+b7b755a7" 
}
Report problem or email us if you suspect this is a bug.

Weird though, it should have reused qr1hi-8i9sb-cazntclejrng7dg, not sure if that's a typo after "reused job" as I think that means its pointing at the job it reused.

So I think theres a bug here where the log connection didn't translate into the reused job?

#2 Updated by Brad Chapman 2 months ago

Thanks Bryan for the digging on this. I didn't intentionally delete any log files but am not sure if the failed retrieval job without a log is from a previous run. Previous runs did use a different Docker container so I think they'll be ignored for re-use (since I didn't have `--ignore-docker-for-reuse` set). I had a few gos at this earlier fixing bugs in bcbio related to hg38 and samtools variant calling. Not sure if any of that helps but just giving the full background on this project.

Also available in: Atom PDF