Bug #9865
closed[CWL] Fix undefined behavior after ignoring an unhandled exception
Description
Example from the wild¶
2016-08-26 13:28:37 arvados.cwl-runner[20416] ERROR: While getting final output object: global name 'adjustFiles' is not defined 2016-08-26 13:28:37 arvados.cwl-runner[20416] INFO: Overall process status is success
There's a section here where any exception will be reported but then leave "outputs" in some half-baked state.
https://github.com/curoverse/arvados/blob/master/sdk/cwl/arvados_cwl/runner.py#L131
try:
outc = arvados.collection.Collection(record["output"])
with outc.open("cwl.output.json") as f:
outputs = json.load(f)
def keepify(fileobj):
path = fileobj["location"]
if not path.startswith("keep:"):
fileobj["location"] = "keep:%s/%s" % (record["output"], path)
adjustFileObjs(outputs, keepify)
adjustDirObjs(outputs, keepify)
except Exception as e:
logger.error("While getting final output object: %s", e)
This code should either:
- Reset outputs to None in the "except" block; or
- Make the "try" scope smaller, so once "outputs" isn't None, unexpected exceptions get propagated up.
It should also log the full backtrace for the caught exception.
The "try" block looks like it started accidentally including too much code in c8d9a898cde654b53200bda0b0ef8b406dd71739
Another example from the wild¶
2019-01-30T04:31:12.159992495Z Exception RuntimeError: 'maximum recursion depth exceeded while calling a Python object' in <bound method _fileobject.__del__ of <socket._fileobject object at 0x7f1093013f50>> ignored 2019-01-30T04:31:12.161843551Z arvados.cwl-runner WARNING: Error checking states on API server: maximum recursion depth exceeded while calling a Python object
After this, arvados-cwl-runner stopped producing logs every 3 seconds, and appeared to be deadlocked.
Another example the following day reported a "maximum recursion depth" error, but kept logging "cwltool DEBUG: [workflow workflow.json#main] job step [...] not ready".
Updated by Tom Clegg almost 6 years ago
The first example was at least partly addressed in b9f7a8693579045f05d142dba8bffd2c5660dfce (it now fails instead of proceeding with undefined results). I'm not sure whether logger
here is configured such that logger.exception()
prints a stack trace.
Grep reveals some other dubious uses of "except Exception", like this
except Exception as e:
logger.info("Setting container output: %s", e)
Updated by Tom Morris almost 6 years ago
- Target version set to 2019-02-27 Sprint
Updated by Eric Biagiotti almost 6 years ago
- Status changed from New to In Progress
Updated by Eric Biagiotti over 5 years ago
9865-cwl-fix-ignored-exceptions at 6ab33cb91de748f4f8ec8e12377f0b0650804db1
Added stack trace info by using logger.exception
where appropriate, cleaned up formatting, added comments clarifying exclusion of stack traces in some instances, and fixed some exception handling. More details in the commit logs.
Unit tests: https://ci.curoverse.com/view/Developer/job/developer-run-tests/1085
CWL test: https://ci.curoverse.com/view/CWL/job/arvados-cwl-conformance-tests/31/
Updated by Peter Amstutz over 5 years ago
Reviewing 9865-cwl-fix-ignored-exceptions @ 6ab33cb91de748f4f8ec8e12377f0b0650804db1
I think you missed this one in executor.py:364
except Exception as e: logger.warning("Error checking states on API server: %s", e) remain_wait = self.poll_interval continue
Updated by Eric Biagiotti over 5 years ago
Peter Amstutz wrote:
Reviewing 9865-cwl-fix-ignored-exceptions @ 6ab33cb91de748f4f8ec8e12377f0b0650804db1
I think you missed this one in executor.py:364
[...]
I left line 364 deliberately. It seemed like a stack trace would be more clutter than helpful here, but I'll defer to you on this if you think it would be helpful.
Updated by Peter Amstutz over 5 years ago
Eric Biagiotti wrote:
Peter Amstutz wrote:
Reviewing 9865-cwl-fix-ignored-exceptions @ 6ab33cb91de748f4f8ec8e12377f0b0650804db1
I think you missed this one in executor.py:364
[...]
I left line 364 deliberately. It seemed like a stack trace would be more clutter than helpful here, but I'll defer to you on this if you think it would be helpful.
The entire motivation for this ticket to be put on the sprint was because there's a bug that happens in that specific try/catch block and we were hoping to get a stack trace for it.
Updated by Eric Biagiotti over 5 years ago
Peter Amstutz wrote:
Eric Biagiotti wrote:
Peter Amstutz wrote:
Reviewing 9865-cwl-fix-ignored-exceptions @ 6ab33cb91de748f4f8ec8e12377f0b0650804db1
I think you missed this one in executor.py:364
[...]
I left line 364 deliberately. It seemed like a stack trace would be more clutter than helpful here, but I'll defer to you on this if you think it would be helpful.
The entire motivation for this ticket to be put on the sprint was because there's a bug that happens in that specific try/catch block and we were hoping to get a stack trace for it.
Thanks for the clarification. Got caught up in the first example and missed the second one in the description. Fixed in 50ecc8767d360670b0f8906da4daa87e56779e54
Unit tests: https://ci.curoverse.com/view/Developer/job/developer-run-tests/1087/
CWL test: https://ci.curoverse.com/view/CWL/job/arvados-cwl-conformance-tests/34/
Finally got workbench integration tests to pass: https://ci.curoverse.com/job/developer-run-tests-apps-workbench-integration/1127/
Updated by Eric Biagiotti over 5 years ago
- Target version changed from 2019-02-27 Sprint to 2019-03-13 Sprint
Updated by Peter Amstutz over 5 years ago
Unfortunately error handling is hard to test, and the unit tests don't cover most of these exception cases. I've had exception branches that themselves have bugs that go undetected for months until the one time the error gets thrown, and then instead of logging the error we get a new one instead.
However as a second pair of eyes, I don't see any obvious mistakes, so I think this LGTM.
Updated by Eric Biagiotti over 5 years ago
- Status changed from In Progress to Resolved