Project

General

Profile

Actions

Bug #16315

closed

Job made input collection unavailable

Added by Sarah Zaranek about 4 years ago. Updated about 4 years ago.

Status:
Resolved
Priority:
High
Assigned To:
-
Category:
-
Target version:
-
Story points:
-

Description

I ran a job, it ran out of temp space (it did showed as passed and not failed which I find strange but I stopped the job's next step because I was watching it like a hawk). The job somehow when finished made my input collection (which was the output of a downloading job I had run previously) unavailable. This is very troubling since a job is corrupting input in keep and keeping the user the from accessing. The job itself was not doing anything to the data files themselves except reading them in. It ran out of memory on a final write out step and/or even in a logging step.

Ran that ran out of memory (bwamem-samtools-view step is the one that ran out of memory) :
https://workbench.su92l.arvadosapi.com/container_requests/su92l-xvhdp-mcmbdrxoouary4r

Collection that I can now not access:
su92l-4zz18-m4r33xx05xxbqti (Unavailable)

Job that originally created that collection (note it says failed just cause I messed up the cwl to pull the final results, the download worked fine I just took the data from the steps):

https://workbench.su92l.arvadosapi.com/container_requests/su92l-xvhdp-0dti96oardhk6jz

One of the original steps that created the input:
https://workbench.su92l.arvadosapi.com/container_requests/su92l-xvhdp-6y4hpvsm5muzw4f

Actions #1

Updated by Peter Amstutz about 4 years ago

1) It looks like the bwa step ran for such a long time that the block signatures expired.

Two things.

(a) You should make sure it fails properly. It would have returned I/O errors to bwa but it succeeded because by default the shell gives you the exit code of the last program in the pipeline -- samtools -- not bwa. You need to add something like set -o pipefail ; to the beginning of the shell command.

(b) The block signatures should not have expired so soon, the configuration has them valid for 2 weeks. Also arv-mount is supposed to refresh the block signatures so this doesn't happen.

2) The collection is not lost, but it is trashed. According to the audit log, it was created with an expiration date already set. I'm trying to figure out why it did that. In the meantime, it can be untrashed:

https://workbench.su92l.arvadosapi.com/actions?uuid=su92l-4zz18-m4r33xx05xxbqti

Actions #2

Updated by Peter Amstutz about 4 years ago

  • Status changed from New to Resolved

We figured it out. In fact everything is behaving as intended.

The download step had arv:IntermediateOutput set to expire the collection in 24 hours. Because the workflow did not run to completion, the output was never copied to a permanent collection. Using the intermediate collection meant using a ticking time bomb because it was set to expire in 24 hours. Bwa ran for nearly 24 hours and then the block signatures expired, so it started getting read errors. However it didn't fail because the shell pipeline used the exit code from the final program in the pipeline (samtools).

Actions

Also available in: Atom PDF