Project

General

Profile

Actions

Bug #9865

closed

[CWL] Fix undefined behavior after ignoring an unhandled exception

Added by Tom Clegg over 7 years ago. Updated about 5 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Eric Biagiotti
Category:
SDKs
Target version:
Story points:
-
Release relationship:
Auto

Description

Example from the wild

2016-08-26 13:28:37 arvados.cwl-runner[20416] ERROR: While getting final output object: global name 'adjustFiles' is not defined
2016-08-26 13:28:37 arvados.cwl-runner[20416] INFO: Overall process status is success

There's a section here where any exception will be reported but then leave "outputs" in some half-baked state.

https://github.com/curoverse/arvados/blob/master/sdk/cwl/arvados_cwl/runner.py#L131

            try:
                outc = arvados.collection.Collection(record["output"])
                with outc.open("cwl.output.json") as f:
                    outputs = json.load(f)
                def keepify(fileobj):
                    path = fileobj["location"]
                    if not path.startswith("keep:"):
                        fileobj["location"] = "keep:%s/%s" % (record["output"], path)
                adjustFileObjs(outputs, keepify)
                adjustDirObjs(outputs, keepify)
            except Exception as e:
                logger.error("While getting final output object: %s", e)
This code should either:
  • Reset outputs to None in the "except" block; or
  • Make the "try" scope smaller, so once "outputs" isn't None, unexpected exceptions get propagated up.

It should also log the full backtrace for the caught exception.

The "try" block looks like it started accidentally including too much code in c8d9a898cde654b53200bda0b0ef8b406dd71739

Another example from the wild

2019-01-30T04:31:12.159992495Z Exception RuntimeError: 'maximum recursion depth exceeded while calling a Python object' in <bound method _fileobject.__del__ of <socket._fileobject object at 0x7f1093013f50>> ignored
2019-01-30T04:31:12.161843551Z arvados.cwl-runner WARNING: Error checking states on API server: maximum recursion depth exceeded while calling a Python object

After this, arvados-cwl-runner stopped producing logs every 3 seconds, and appeared to be deadlocked.

Another example the following day reported a "maximum recursion depth" error, but kept logging "cwltool DEBUG: [workflow workflow.json#main] job step [...] not ready".


Subtasks 1 (0 open1 closed)

Task #14840: Review 9865-cwl-fix-ignored-exceptionsResolvedPeter Amstutz02/25/2019Actions
Actions #1

Updated by Tom Clegg about 5 years ago

  • Description updated (diff)
Actions #2

Updated by Tom Clegg about 5 years ago

  • Description updated (diff)
Actions #3

Updated by Tom Clegg about 5 years ago

The first example was at least partly addressed in b9f7a8693579045f05d142dba8bffd2c5660dfce (it now fails instead of proceeding with undefined results). I'm not sure whether logger here is configured such that logger.exception() prints a stack trace.

Grep reveals some other dubious uses of "except Exception", like this

            except Exception as e:
                logger.info("Setting container output: %s", e)
Actions #4

Updated by Tom Morris about 5 years ago

  • Target version set to 2019-02-27 Sprint
Actions #5

Updated by Tom Morris about 5 years ago

  • Assigned To set to Eric Biagiotti
Actions #6

Updated by Eric Biagiotti about 5 years ago

  • Status changed from New to In Progress
Actions #7

Updated by Eric Biagiotti about 5 years ago

9865-cwl-fix-ignored-exceptions at 6ab33cb91de748f4f8ec8e12377f0b0650804db1

Added stack trace info by using logger.exception where appropriate, cleaned up formatting, added comments clarifying exclusion of stack traces in some instances, and fixed some exception handling. More details in the commit logs.

Unit tests: https://ci.curoverse.com/view/Developer/job/developer-run-tests/1085
CWL test: https://ci.curoverse.com/view/CWL/job/arvados-cwl-conformance-tests/31/

Actions #8

Updated by Peter Amstutz about 5 years ago

Reviewing 9865-cwl-fix-ignored-exceptions @ 6ab33cb91de748f4f8ec8e12377f0b0650804db1

I think you missed this one in executor.py:364

                    except Exception as e:
                        logger.warning("Error checking states on API server: %s", e)
                        remain_wait = self.poll_interval
                        continue
Actions #9

Updated by Eric Biagiotti about 5 years ago

Peter Amstutz wrote:

Reviewing 9865-cwl-fix-ignored-exceptions @ 6ab33cb91de748f4f8ec8e12377f0b0650804db1

I think you missed this one in executor.py:364

[...]

I left line 364 deliberately. It seemed like a stack trace would be more clutter than helpful here, but I'll defer to you on this if you think it would be helpful.

Actions #10

Updated by Peter Amstutz about 5 years ago

Eric Biagiotti wrote:

Peter Amstutz wrote:

Reviewing 9865-cwl-fix-ignored-exceptions @ 6ab33cb91de748f4f8ec8e12377f0b0650804db1

I think you missed this one in executor.py:364

[...]

I left line 364 deliberately. It seemed like a stack trace would be more clutter than helpful here, but I'll defer to you on this if you think it would be helpful.

The entire motivation for this ticket to be put on the sprint was because there's a bug that happens in that specific try/catch block and we were hoping to get a stack trace for it.

Actions #12

Updated by Eric Biagiotti about 5 years ago

Peter Amstutz wrote:

Eric Biagiotti wrote:

Peter Amstutz wrote:

Reviewing 9865-cwl-fix-ignored-exceptions @ 6ab33cb91de748f4f8ec8e12377f0b0650804db1

I think you missed this one in executor.py:364

[...]

I left line 364 deliberately. It seemed like a stack trace would be more clutter than helpful here, but I'll defer to you on this if you think it would be helpful.

The entire motivation for this ticket to be put on the sprint was because there's a bug that happens in that specific try/catch block and we were hoping to get a stack trace for it.

Thanks for the clarification. Got caught up in the first example and missed the second one in the description. Fixed in 50ecc8767d360670b0f8906da4daa87e56779e54

Unit tests: https://ci.curoverse.com/view/Developer/job/developer-run-tests/1087/
CWL test: https://ci.curoverse.com/view/CWL/job/arvados-cwl-conformance-tests/34/

Finally got workbench integration tests to pass: https://ci.curoverse.com/job/developer-run-tests-apps-workbench-integration/1127/

Actions #13

Updated by Eric Biagiotti about 5 years ago

  • Target version changed from 2019-02-27 Sprint to 2019-03-13 Sprint
Actions #14

Updated by Peter Amstutz about 5 years ago

Unfortunately error handling is hard to test, and the unit tests don't cover most of these exception cases. I've had exception branches that themselves have bugs that go undetected for months until the one time the error gets thrown, and then instead of logging the error we get a new one instead.

However as a second pair of eyes, I don't see any obvious mistakes, so I think this LGTM.

Actions #15

Updated by Eric Biagiotti about 5 years ago

  • Status changed from In Progress to Resolved
Actions #16

Updated by Tom Morris about 5 years ago

  • Release set to 15
Actions

Also available in: Atom PDF