Project

General

Profile

Actions

Bug #14574

closed

Workflow deadlocked

Added by Peter Amstutz over 5 years ago. Updated about 5 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Story points:
-
Release relationship:
Auto

Description

When executing ExpressionTool, it doesn't take the workflow execution lock when it calls the output callback. This is a problem when multiple ExpressionTool jobs are executing in threads.


Subtasks 2 (0 open2 closed)

Task #14575: Make note in 1.3.0 release notes about bug and workaroundResolvedPeter Amstutz12/06/2018Actions
Task #14580: Review 14574-expression-fixResolvedPeter Amstutz12/06/2018Actions
Actions #1

Updated by Peter Amstutz over 5 years ago

  • Status changed from New to In Progress
Actions #2

Updated by Peter Amstutz over 5 years ago

  • Description updated (diff)
Actions #5

Updated by Peter Amstutz over 5 years ago

  • Assigned To set to Peter Amstutz
Actions #6

Updated by Peter Amstutz over 5 years ago

The quick fix is to change the default, but best is to fix the underlying problem.

Need to make a note on 1.3.0 release notes about the bug and its workaround, and try to fix it for 1.3.1.

Actions #7

Updated by Peter Amstutz over 5 years ago

14574-expression-fix @ 4ad50255921b571d8e7748b4c6c098b53d803183

https://ci.curoverse.com/view/Developer/job/developer-run-tests/997/

Override ExpressionTool with ArvadosExpressionTool and ensure that the output callback is wrapped to take the workflow lock.

Actions #8

Updated by Peter Amstutz over 5 years ago

While I'm pretty sure failure to lock the callback from ExpressionTool is a bug, and it could plausibly cause the behavior being reported, I haven't actually been able to reproduce the reported deadlock, so I can't say definitively that this fixes it.

Actions #9

Updated by Lucas Di Pentima over 5 years ago

The locking LGTM. How can we test this? Maybe with the original workflow?

Actions #10

Updated by Peter Amstutz over 5 years ago

Lucas Di Pentima wrote:

The locking LGTM. How can we test this? Maybe with the original workflow?

Yea, I've already tried it with the original workflow, the problem is I haven't been able to reproduce the bug, so it is speculative. There's definitely a race condition that is fixed by this branch, and a race could create the problems we're seeing, but I can't pin it down either way. I can run it a few more times and see what happens.

Actions #11

Updated by Peter Amstutz over 5 years ago

Peter Amstutz wrote:

Lucas Di Pentima wrote:

The locking LGTM. How can we test this? Maybe with the original workflow?

Yea, I've already tried it with the original workflow, the problem is I haven't been able to reproduce the bug, so it is speculative. There's definitely a race condition that is fixed by this branch, and a race could create the problems we're seeing, but I can't pin it down either way. I can run it a few more times and see what happens.

I re-ran the job (e51c5-xvhdp-g1kjpf3j7zo6ou1) from the original failure report (e51c5-xvhdp-tlnzytroy9m380j). It finished successfully in 2 minutes (all containers reused.)

Running with job reuse isn't exactly the same as running a normal job, so the only other thing I can think of would be to re-run without job reuse, but that's expensive.

Actions #12

Updated by Peter Amstutz over 5 years ago

  • Status changed from In Progress to Resolved
Actions #13

Updated by Tom Morris about 5 years ago

  • Release set to 15
Actions

Also available in: Atom PDF