Project

General

Profile

Actions

Bug #15164

closed

Container request not finalized

Added by Peter Amstutz over 5 years ago. Updated over 5 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Story points:
-
Release relationship:
Auto

Description

CWL tests sometimes get stuck and then time out.

It does not happen consistently. Experimentally, running tests with -j5 seems to increase the odds of running into it. I've specifically noticed it with the arvbox-based tests running locally and on ci.commonwl.org. I'm not sure if I have observed it on the dev clusters.

What seems to happen is there is a container request where the underlying container is completed, but the container request is not finalized. The container request remains in "Committed" state, and the output or logs are not set. As a result, the workflow runner becomes cannot make progress.


Subtasks 1 (0 open1 closed)

Task #15169: Review 15164-cr-finalize-lockResolvedPeter Amstutz04/30/2019Actions
Actions #1

Updated by Peter Amstutz over 5 years ago

  • Status changed from New to In Progress
Actions #2

Updated by Peter Amstutz over 5 years ago

  • Description updated (diff)
Actions #3

Updated by Peter Amstutz over 5 years ago

Update: current theory is that this is a race condition between Container.handle_completed and container reuse.

  1. (A) Container completion starts
  2. (A) Container gets list of container requests to finalize
  3. (B) New container request is submitted
  4. (B) New container request finds container to reuse
  5. (B) Container appears to still be Running (because update to Complete happened in the other transaction, which hasn't committed yet) so it doesn't finalize
  6. (A) Container completion finalizes container requests found in step 2, which does not include new container request

Result: new container request never gets finalized.

Proposed solution:

  • Container.handle_completed gets a row level lock on the container, and then on each container request (only when creating a new retry container)
  • ContainerRequest.finalize_if_needed gets a row level lock on the container, then the container request, only then checks state and determines whether to finalize.
Actions #4

Updated by Peter Amstutz over 5 years ago

Another variation on this race is that retries may be mishandled, a new container request could finalize on a container that failed, and not get a retry container.

Actions #5

Updated by Peter Amstutz over 5 years ago

15164-cr-finalize-lock @ 24301058687be0d42883871d168c15dac98668c2

https://ci.curoverse.com/view/Developer/job/developer-run-tests/1222/

Addresses race condition between container completion and container
reuse. Without this locking, a container request can resolve and
attempt to reuse a container which is concurrently being completed,
resulting in a race condition that results in the container request
never being finalized.

Actions #6

Updated by Peter Amstutz over 5 years ago

  • Target version set to 2019-05-08 Sprint
Actions #7

Updated by Peter Amstutz over 5 years ago

  • Assigned To set to Peter Amstutz
Actions #8

Updated by Lucas Di Pentima over 5 years ago

This LGTM, thanks!

Actions #9

Updated by Peter Amstutz over 5 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 0 to 100
Actions #10

Updated by Tom Morris over 5 years ago

  • Release set to 15
Actions

Also available in: Atom PDF