Bug #9865

[CWL] Fix undefined behavior after ignoring an unhandled exception

Added by Tom Clegg over 3 years ago. Updated 9 months ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
SDKs
Target version:
Start date:
02/25/2019
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
-
Release relationship:
Auto

Description

Example from the wild

2016-08-26 13:28:37 arvados.cwl-runner[20416] ERROR: While getting final output object: global name 'adjustFiles' is not defined
2016-08-26 13:28:37 arvados.cwl-runner[20416] INFO: Overall process status is success

There's a section here where any exception will be reported but then leave "outputs" in some half-baked state.

https://github.com/curoverse/arvados/blob/master/sdk/cwl/arvados_cwl/runner.py#L131

            try:
                outc = arvados.collection.Collection(record["output"])
                with outc.open("cwl.output.json") as f:
                    outputs = json.load(f)
                def keepify(fileobj):
                    path = fileobj["location"]
                    if not path.startswith("keep:"):
                        fileobj["location"] = "keep:%s/%s" % (record["output"], path)
                adjustFileObjs(outputs, keepify)
                adjustDirObjs(outputs, keepify)
            except Exception as e:
                logger.error("While getting final output object: %s", e)
This code should either:
  • Reset outputs to None in the "except" block; or
  • Make the "try" scope smaller, so once "outputs" isn't None, unexpected exceptions get propagated up.

It should also log the full backtrace for the caught exception.

The "try" block looks like it started accidentally including too much code in c8d9a898cde654b53200bda0b0ef8b406dd71739

Another example from the wild

2019-01-30T04:31:12.159992495Z Exception RuntimeError: 'maximum recursion depth exceeded while calling a Python object' in <bound method _fileobject.__del__ of <socket._fileobject object at 0x7f1093013f50>> ignored
2019-01-30T04:31:12.161843551Z arvados.cwl-runner WARNING: Error checking states on API server: maximum recursion depth exceeded while calling a Python object

After this, arvados-cwl-runner stopped producing logs every 3 seconds, and appeared to be deadlocked.

Another example the following day reported a "maximum recursion depth" error, but kept logging "cwltool DEBUG: [workflow workflow.json#main] job step [...] not ready".


Subtasks

Task #14840: Review 9865-cwl-fix-ignored-exceptionsResolvedPeter Amstutz

Associated revisions

Revision e231fef3
Added by Eric Biagiotti 10 months ago

Merge branch '9865-cwl-fix-ignored-exceptions'

refs #9865

Arvados-DCO-1.1-Signed-off-by: Eric Biagiotti <>

History

#1 Updated by Tom Clegg 11 months ago

  • Description updated (diff)

#2 Updated by Tom Clegg 10 months ago

  • Description updated (diff)

#3 Updated by Tom Clegg 10 months ago

The first example was at least partly addressed in b9f7a8693579045f05d142dba8bffd2c5660dfce (it now fails instead of proceeding with undefined results). I'm not sure whether logger here is configured such that logger.exception() prints a stack trace.

Grep reveals some other dubious uses of "except Exception", like this

            except Exception as e:
                logger.info("Setting container output: %s", e)

#4 Updated by Tom Morris 10 months ago

  • Target version set to 2019-02-27 Sprint

#5 Updated by Tom Morris 10 months ago

  • Assigned To set to Eric Biagiotti

#6 Updated by Eric Biagiotti 10 months ago

  • Status changed from New to In Progress

#7 Updated by Eric Biagiotti 10 months ago

9865-cwl-fix-ignored-exceptions at 6ab33cb91de748f4f8ec8e12377f0b0650804db1

Added stack trace info by using logger.exception where appropriate, cleaned up formatting, added comments clarifying exclusion of stack traces in some instances, and fixed some exception handling. More details in the commit logs.

Unit tests: https://ci.curoverse.com/view/Developer/job/developer-run-tests/1085
CWL test: https://ci.curoverse.com/view/CWL/job/arvados-cwl-conformance-tests/31/

#8 Updated by Peter Amstutz 10 months ago

Reviewing 9865-cwl-fix-ignored-exceptions @ 6ab33cb91de748f4f8ec8e12377f0b0650804db1

I think you missed this one in executor.py:364

                    except Exception as e:
                        logger.warning("Error checking states on API server: %s", e)
                        remain_wait = self.poll_interval
                        continue

#9 Updated by Eric Biagiotti 10 months ago

Peter Amstutz wrote:

Reviewing 9865-cwl-fix-ignored-exceptions @ 6ab33cb91de748f4f8ec8e12377f0b0650804db1

I think you missed this one in executor.py:364

[...]

I left line 364 deliberately. It seemed like a stack trace would be more clutter than helpful here, but I'll defer to you on this if you think it would be helpful.

#10 Updated by Peter Amstutz 10 months ago

Eric Biagiotti wrote:

Peter Amstutz wrote:

Reviewing 9865-cwl-fix-ignored-exceptions @ 6ab33cb91de748f4f8ec8e12377f0b0650804db1

I think you missed this one in executor.py:364

[...]

I left line 364 deliberately. It seemed like a stack trace would be more clutter than helpful here, but I'll defer to you on this if you think it would be helpful.

The entire motivation for this ticket to be put on the sprint was because there's a bug that happens in that specific try/catch block and we were hoping to get a stack trace for it.

#12 Updated by Eric Biagiotti 10 months ago

Peter Amstutz wrote:

Eric Biagiotti wrote:

Peter Amstutz wrote:

Reviewing 9865-cwl-fix-ignored-exceptions @ 6ab33cb91de748f4f8ec8e12377f0b0650804db1

I think you missed this one in executor.py:364

[...]

I left line 364 deliberately. It seemed like a stack trace would be more clutter than helpful here, but I'll defer to you on this if you think it would be helpful.

The entire motivation for this ticket to be put on the sprint was because there's a bug that happens in that specific try/catch block and we were hoping to get a stack trace for it.

Thanks for the clarification. Got caught up in the first example and missed the second one in the description. Fixed in 50ecc8767d360670b0f8906da4daa87e56779e54

Unit tests: https://ci.curoverse.com/view/Developer/job/developer-run-tests/1087/
CWL test: https://ci.curoverse.com/view/CWL/job/arvados-cwl-conformance-tests/34/

Finally got workbench integration tests to pass: https://ci.curoverse.com/job/developer-run-tests-apps-workbench-integration/1127/

#13 Updated by Eric Biagiotti 10 months ago

  • Target version changed from 2019-02-27 Sprint to 2019-03-13 Sprint

#14 Updated by Peter Amstutz 10 months ago

Unfortunately error handling is hard to test, and the unit tests don't cover most of these exception cases. I've had exception branches that themselves have bugs that go undetected for months until the one time the error gets thrown, and then instead of logging the error we get a new one instead.

However as a second pair of eyes, I don't see any obvious mistakes, so I think this LGTM.

#15 Updated by Eric Biagiotti 10 months ago

  • Status changed from In Progress to Resolved

#16 Updated by Tom Morris 10 months ago

  • Release set to 15

Also available in: Atom PDF