Bug #15164
closedContainer request not finalized
Description
CWL tests sometimes get stuck and then time out.
It does not happen consistently. Experimentally, running tests with -j5 seems to increase the odds of running into it. I've specifically noticed it with the arvbox-based tests running locally and on ci.commonwl.org. I'm not sure if I have observed it on the dev clusters.
What seems to happen is there is a container request where the underlying container is completed, but the container request is not finalized. The container request remains in "Committed" state, and the output or logs are not set. As a result, the workflow runner becomes cannot make progress.
Updated by Peter Amstutz over 5 years ago
- Status changed from New to In Progress
Updated by Peter Amstutz over 5 years ago
Update: current theory is that this is a race condition between Container.handle_completed
and container reuse.
- (A) Container completion starts
- (A) Container gets list of container requests to finalize
- (B) New container request is submitted
- (B) New container request finds container to reuse
- (B) Container appears to still be Running (because update to Complete happened in the other transaction, which hasn't committed yet) so it doesn't finalize
- (A) Container completion finalizes container requests found in step 2, which does not include new container request
Result: new container request never gets finalized.
Proposed solution:
- Container.handle_completed gets a row level lock on the container, and then on each container request (only when creating a new retry container)
- ContainerRequest.finalize_if_needed gets a row level lock on the container, then the container request, only then checks state and determines whether to finalize.
Updated by Peter Amstutz over 5 years ago
Another variation on this race is that retries may be mishandled, a new container request could finalize on a container that failed, and not get a retry container.
Updated by Peter Amstutz over 5 years ago
15164-cr-finalize-lock @ 24301058687be0d42883871d168c15dac98668c2
https://ci.curoverse.com/view/Developer/job/developer-run-tests/1222/
Addresses race condition between container completion and container
reuse. Without this locking, a container request can resolve and
attempt to reuse a container which is concurrently being completed,
resulting in a race condition that results in the container request
never being finalized.
Updated by Peter Amstutz over 5 years ago
- Target version set to 2019-05-08 Sprint
Updated by Peter Amstutz over 5 years ago
- Status changed from In Progress to Resolved
- % Done changed from 0 to 100
Applied in changeset arvados|cf7cdeb32bc3596f644ab0871924972abf972290.