Bug #21770
openEvaluation expressions that timeout can raise JSONDecodeError
Description
This is a cwltool bug that arvados-cwl-runner inherits. I have filed it upstream, but I'm duplicating the information here for awareness and tracking.
Expected Behavior
When an evaluation expression exceeds the timeout, the resulting error should be clear that is the cause.
Actual Behavior
At least some of the time, this situation raises a JSONDecodeError
exception. If the user doesn't realize the timeout might be the issue, this can cause them to lose a lot of time trying to debug an issue in the expression itself where none exists.
Full Traceback
This issue was noticed with a "real" expression that similarly built a large array of an array of files, and hit the default evaluation timeout. However, you can reproduce the issue using the attached minimal reproduction and a small evaluation timeout:
% venv/bin/cwltool --debug --eval-timeout=.1 wf.cwl array-in.yml INFO venv/bin/cwltool 3.1.20240404144621 INFO Resolved 'wf.cwl' to 'file:///home/brett/Scratch/wf.cwl' DEBUG [workflow ] initialized from file:///home/brett/Scratch/wf.cwl INFO [workflow ] start DEBUG [workflow ] inputs {} INFO [workflow ] starting step build_array DEBUG [step build_array] job input {} DEBUG [step build_array] evaluated job input to {} INFO [step build_array] start Expecting value: line 1 column 1 (char 0) script was: 01 "use strict"; 02 var inputs = {}; 03 var self = null; 04 var runtime = { 05 "cores": 1, 06 "ram": 256, 07 "tmpdirSize": 1024, 08 "outdirSize": 1024, 09 "tmpdir": null, 10 "outdir": null 11 }; 12 (function(){ 13 var result = []; 14 for (var i=0; i<100000; i++) { 15 var arr = []; 16 result.push(arr); 17 for (var j=0; j<100000; j++) { 18 var num = i * 100000 + j; 19 arr.push({ 20 class: "File", 21 location: "file:///tmp/array.cwl/file" + num.toString() + ".fq", 22 format: "http://edamontology.org/format_1930" 23 }); 24 } 25 } 26 return {"out_arr": result}; 27 })() stdout was: '' stderr was: '' Traceback (most recent call last): File "/home/brett/Scratch/venv/lib/python3.12/site-packages/cwl_utils/sandboxjs.py", line 512, in eval return cast(CWLOutputType, json.loads(stdout)) ^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/json/__init__.py", line 346, in loads return _default_decoder.decode(s) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/json/decoder.py", line 337, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/json/decoder.py", line 355, in raw_decode raise JSONDecodeError("Expecting value", s, err.value) from None json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0) The above exception was the direct cause of the following exception: [Note: Everything from here on is basically propagation of the same root cause exception, but here it is for completeness:] Traceback (most recent call last): File "/home/brett/Scratch/venv/lib/python3.12/site-packages/cwl_utils/expression.py", line 315, in do_eval return interpolate( ^^^^^^^^^^^^ File "/home/brett/Scratch/venv/lib/python3.12/site-packages/cwl_utils/expression.py", line 215, in interpolate e = evaluator( ^^^^^^^^^^ File "/home/brett/Scratch/venv/lib/python3.12/site-packages/cwl_utils/expression.py", line 161, in evaluator return cast(CWLOutputType, js_engine.eval(ex, jslib, **kwargs)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/brett/Scratch/venv/lib/python3.12/site-packages/cwl_utils/sandboxjs.py", line 514, in eval raise JavascriptException( cwl_utils.errors.JavascriptException: Expecting value: line 1 column 1 (char 0) script was: 01 "use strict"; 02 var inputs = {}; 03 var self = null; 04 var runtime = { 05 "cores": 1, 06 "ram": 256, 07 "tmpdirSize": 1024, 08 "outdirSize": 1024, 09 "tmpdir": null, 10 "outdir": null 11 }; 12 (function(){ 13 var result = []; 14 for (var i=0; i<100000; i++) { 15 var arr = []; 16 result.push(arr); 17 for (var j=0; j<100000; j++) { 18 var num = i * 100000 + j; 19 arr.push({ 20 class: "File", 21 location: "file:///tmp/array.cwl/file" + num.toString() + ".fq", 22 format: "http://edamontology.org/format_1930" 23 }); 24 } 25 } 26 return {"out_arr": result}; 27 })() stdout was: '' stderr was: '' ERROR Got workflow error [error messages and tracebacks repeat from here]
Diagnosis
This occurs because of a few related issues in the way that the expression evaluation timeout is implemented:
- The evaluator stops reading expression output as soon as the timeout has been reached (i.e., the child process has been sent SIGKILL): https://github.com/common-workflow-language/cwl-utils/blob/0ad6983898f0d9001fe0f416f97c4d8b940e384a/cwl_utils/sandboxjs.py#L252
- Then it unconditionally removes the "process finished str" from the output. This is a bug because if the child process timed out, there's no reason to expect that string to be in the output, so this is just arbitrary truncation: https://github.com/common-workflow-language/cwl-utils/blob/0ad6983898f0d9001fe0f416f97c4d8b940e384a/cwl_utils/sandboxjs.py#L269-L270
- Then it polls the child process a couple times, and determines success based on this. If the child process has not actually received the SIGKILL and ended yet, it's possible
nodejs.poll()
will returnNone
, causing the conditional to assume it succeeded: https://github.com/common-workflow-language/cwl-utils/blob/0ad6983898f0d9001fe0f416f97c4d8b940e384a/cwl_utils/sandboxjs.py#L272-L280
In general, there's a lot of code in exec_js_process
that seems like it could be handled by the subprocess module directly, following the example code documented with !Popen.communicate
. A rough sketch would be something like:
def exec_js_process(
self,
js_text: str,
timeout: float = default_timeout,
...,
) -> Tuple[int, str, str]:
...
with subprocess.Popen(
["nodejs", "--eval", js_text],
stdin=subprocess.DEVNULL,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
text=True,
encoding='utf-8',
) as nodejs:
try:
stdoutdata, stderrdata = nodejs.communicate(timeout=timeout)
except subprocess.TimeoutExpired:
nodejs.kill()
stdoutdata, stderrdata = nodejs.communicate()
returncode = -1
else:
returncode = nodejs.returncode
return returncode, stdoutdata, stderrdata
This would still need code at the top to find the name of the Node command, handle other arguments, etc. But I think this could replace all the process and timeout management code that currently lives in the method, while addressing the bugs above.
No data to display