Bug #16169

tiling workflow cancelled for unknown reason

Added by Jiayong Li about 1 month ago. Updated 27 days ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Crunch
Target version:
Start date:
03/02/2020
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
-
Release relationship:
Auto

Description

Running tiling workflow but it gets cancelled. https://workbench.su92l.arvadosapi.com/container_requests/su92l-xvhdp-mzrysxcgtubgva9

I tried various run time constraints and workflow parameters, but they all get cancelled.

Before su92l was upgraded, I ran a workflow of the same scale (input also around 2TB), and it was successful. https://workbench.su92l.arvadosapi.com/container_requests/su92l-xvhdp-nm507pzmjqiai4s

Contrasting individual jobs from these two runs, https://workbench.su92l.arvadosapi.com/container_requests/su92l-xvhdp-vdlq5f0hqldttso completed but https://workbench.su92l.arvadosapi.com/container_requests/su92l-xvhdp-t3dtsqsi3vqfetb is cancelled.


Subtasks

Task #16187: Review 16169-cwl-hintsResolvedPeter Amstutz

Associated revisions

Revision c56d0426
Added by Peter Amstutz about 1 month ago

Merge branch '16169-cwl-hints' refs #16169

Arvados-DCO-1.1-Signed-off-by: Peter Amstutz <>

History

#1 Updated by Jiayong Li about 1 month ago

I changed "no_listing" from "hints" to "requirements", still failed https://workbench.su92l.arvadosapi.com/container_requests/su92l-xvhdp-jqx484v754z4vzl

#2 Updated by Lucas Di Pentima about 1 month ago

  • Target version changed from To Be Groomed to 2020-02-26 Sprint
  • Assigned To set to Lucas Di Pentima
  • Status changed from New to In Progress
  • Category set to Crunch

It seems that the container is getting OOM-killed.

We're also getting a warning on the log:

Warning: cwltool: ../../lib/cwl/workflow.json:1:25668: Recursive directory listing has resulted in a large number of
                                     File objects (1733821) passed to the input parameter 'fjdir'. 
                                     This may negatively affect workflow performance and memory use.

                                     If this is a problem, use the hint
                                     'cwltool:LoadListingRequirement' with "shallow_listing" or
                                     "no_listing" to change the directory listing behavior:

                                     $namespaces:
                                       cwltool: "http://commonwl.org/cwltool#" 
                                     hints:
                                       cwltool:LoadListingRequirement:
                                         loadListing: shallow_listing

...but the workflow already has the no_listing hint from previous (pre 2.0) successful runs. Maybe this hint is being ignored?

#3 Updated by Jiayong Li about 1 month ago

specifying "no_listing" on the workflow got ignored
but specifying "no_listing" on the job level works
https://workbench.su92l.arvadosapi.com/container_requests/su92l-xvhdp-lpvofphpiwrtqan

#4 Updated by Peter Amstutz about 1 month ago

  • Target version changed from 2020-02-26 Sprint to 2020-03-11 Sprint

#5 Updated by Peter Amstutz about 1 month ago

  • Assigned To changed from Lucas Di Pentima to Peter Amstutz

#6 Updated by Lucas Di Pentima about 1 month ago

Updates at b12b6c014f0e26fb4c2c2a5ad27a36c3685babf1 - branch 16169-cwl-hints

I was able to reproduce the bug via an a-c-r integration test, handing this off to Peter as I'm a bit stuck and it would be great to have it done for 2.0.1

#7 Updated by Peter Amstutz about 1 month ago

  • Release set to 29

#8 Updated by Peter Amstutz about 1 month ago

For some reason, this bug appears when the workflow is --submitted and run a container, but if run directly on the host with --local it doesn't do it.

#10 Updated by Lucas Di Pentima about 1 month ago

This LGTM, thanks!

#11 Updated by Peter Amstutz 30 days ago

  • Status changed from In Progress to Resolved

Also available in: Atom PDF