Bug #14574

Workflow deadlocked

Added by Peter Amstutz 10 days ago. Updated 1 day ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Start date:
12/06/2018
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
-

Description

When executing ExpressionTool, it doesn't take the workflow execution lock when it calls the output callback. This is a problem when multiple ExpressionTool jobs are executing in threads.


Subtasks

Task #14575: Make note in 1.3.0 release notes about bug and workaroundResolvedPeter Amstutz

Task #14580: Review 14574-expression-fixResolvedPeter Amstutz

Associated revisions

Revision 45b8d592
Added by Peter Amstutz 9 days ago

Merge branch '14574-thread-count-1' refs #14574

Arvados-DCO-1.1-Signed-off-by: Peter Amstutz <>

Revision 5dbc1ae3
Added by Peter Amstutz 7 days ago

Merge branch '14574-expression-fix' refs #14574

Arvados-DCO-1.1-Signed-off-by: Peter Amstutz <>

History

#1 Updated by Peter Amstutz 10 days ago

  • Status changed from New to In Progress

#2 Updated by Peter Amstutz 10 days ago

  • Description updated (diff)

#5 Updated by Peter Amstutz 10 days ago

  • Assigned To set to Peter Amstutz

#6 Updated by Peter Amstutz 10 days ago

The quick fix is to change the default, but best is to fix the underlying problem.

Need to make a note on 1.3.0 release notes about the bug and its workaround, and try to fix it for 1.3.1.

#7 Updated by Peter Amstutz 10 days ago

14574-expression-fix @ 4ad50255921b571d8e7748b4c6c098b53d803183

https://ci.curoverse.com/view/Developer/job/developer-run-tests/997/

Override ExpressionTool with ArvadosExpressionTool and ensure that the output callback is wrapped to take the workflow lock.

#8 Updated by Peter Amstutz 9 days ago

While I'm pretty sure failure to lock the callback from ExpressionTool is a bug, and it could plausibly cause the behavior being reported, I haven't actually been able to reproduce the reported deadlock, so I can't say definitively that this fixes it.

#9 Updated by Lucas Di Pentima 7 days ago

The locking LGTM. How can we test this? Maybe with the original workflow?

#10 Updated by Peter Amstutz 7 days ago

Lucas Di Pentima wrote:

The locking LGTM. How can we test this? Maybe with the original workflow?

Yea, I've already tried it with the original workflow, the problem is I haven't been able to reproduce the bug, so it is speculative. There's definitely a race condition that is fixed by this branch, and a race could create the problems we're seeing, but I can't pin it down either way. I can run it a few more times and see what happens.

#11 Updated by Peter Amstutz 6 days ago

Peter Amstutz wrote:

Lucas Di Pentima wrote:

The locking LGTM. How can we test this? Maybe with the original workflow?

Yea, I've already tried it with the original workflow, the problem is I haven't been able to reproduce the bug, so it is speculative. There's definitely a race condition that is fixed by this branch, and a race could create the problems we're seeing, but I can't pin it down either way. I can run it a few more times and see what happens.

I re-ran the job (e51c5-xvhdp-g1kjpf3j7zo6ou1) from the original failure report (e51c5-xvhdp-tlnzytroy9m380j). It finished successfully in 2 minutes (all containers reused.)

Running with job reuse isn't exactly the same as running a normal job, so the only other thing I can think of would be to re-run without job reuse, but that's expensive.

#12 Updated by Peter Amstutz 1 day ago

  • Status changed from In Progress to Resolved

Also available in: Atom PDF