Bug #15164

Container request not finalized

Added by Peter Amstutz about 2 months ago. Updated 25 days ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Start date:
04/30/2019
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
-
Release relationship:
Auto

Description

CWL tests sometimes get stuck and then time out.

It does not happen consistently. Experimentally, running tests with -j5 seems to increase the odds of running into it. I've specifically noticed it with the arvbox-based tests running locally and on ci.commonwl.org. I'm not sure if I have observed it on the dev clusters.

What seems to happen is there is a container request where the underlying container is completed, but the container request is not finalized. The container request remains in "Committed" state, and the output or logs are not set. As a result, the workflow runner becomes cannot make progress.


Subtasks

Task #15169: Review 15164-cr-finalize-lockResolvedPeter Amstutz

Associated revisions

Revision cf7cdeb3
Added by Peter Amstutz about 1 month ago

Merge branch '15164-cr-finalize-lock' closes #15164

Arvados-DCO-1.1-Signed-off-by: Peter Amstutz <>

History

#1 Updated by Peter Amstutz about 2 months ago

  • Status changed from New to In Progress

#2 Updated by Peter Amstutz about 2 months ago

  • Description updated (diff)

#3 Updated by Peter Amstutz about 2 months ago

Update: current theory is that this is a race condition between Container.handle_completed and container reuse.

  1. (A) Container completion starts
  2. (A) Container gets list of container requests to finalize
  3. (B) New container request is submitted
  4. (B) New container request finds container to reuse
  5. (B) Container appears to still be Running (because update to Complete happened in the other transaction, which hasn't committed yet) so it doesn't finalize
  6. (A) Container completion finalizes container requests found in step 2, which does not include new container request

Result: new container request never gets finalized.

Proposed solution:

  • Container.handle_completed gets a row level lock on the container, and then on each container request (only when creating a new retry container)
  • ContainerRequest.finalize_if_needed gets a row level lock on the container, then the container request, only then checks state and determines whether to finalize.

#4 Updated by Peter Amstutz about 2 months ago

Another variation on this race is that retries may be mishandled, a new container request could finalize on a container that failed, and not get a retry container.

#5 Updated by Peter Amstutz about 2 months ago

15164-cr-finalize-lock @ 24301058687be0d42883871d168c15dac98668c2

https://ci.curoverse.com/view/Developer/job/developer-run-tests/1222/

Addresses race condition between container completion and container
reuse. Without this locking, a container request can resolve and
attempt to reuse a container which is concurrently being completed,
resulting in a race condition that results in the container request
never being finalized.

#6 Updated by Peter Amstutz about 2 months ago

  • Target version set to 2019-05-08 Sprint

#7 Updated by Peter Amstutz about 2 months ago

  • Assigned To set to Peter Amstutz

#8 Updated by Lucas Di Pentima about 2 months ago

This LGTM, thanks!

#9 Updated by Peter Amstutz about 1 month ago

  • Status changed from In Progress to Resolved
  • % Done changed from 0 to 100

#10 Updated by Tom Morris 25 days ago

  • Release set to 15

Also available in: Atom PDF