Bug #15164
closed
Container request not finalized
Added by Peter Amstutz almost 6 years ago.
Updated almost 6 years ago.
Release relationship:
Auto
Description
CWL tests sometimes get stuck and then time out.
It does not happen consistently. Experimentally, running tests with -j5 seems to increase the odds of running into it. I've specifically noticed it with the arvbox-based tests running locally and on ci.commonwl.org. I'm not sure if I have observed it on the dev clusters.
What seems to happen is there is a container request where the underlying container is completed, but the container request is not finalized. The container request remains in "Committed" state, and the output or logs are not set. As a result, the workflow runner becomes cannot make progress.
- Status changed from New to In Progress
- Description updated (diff)
Update: current theory is that this is a race condition between Container.handle_completed
and container reuse.
- (A) Container completion starts
- (A) Container gets list of container requests to finalize
- (B) New container request is submitted
- (B) New container request finds container to reuse
- (B) Container appears to still be Running (because update to Complete happened in the other transaction, which hasn't committed yet) so it doesn't finalize
- (A) Container completion finalizes container requests found in step 2, which does not include new container request
Result: new container request never gets finalized.
Proposed solution:
- Container.handle_completed gets a row level lock on the container, and then on each container request (only when creating a new retry container)
- ContainerRequest.finalize_if_needed gets a row level lock on the container, then the container request, only then checks state and determines whether to finalize.
Another variation on this race is that retries may be mishandled, a new container request could finalize on a container that failed, and not get a retry container.
- Target version set to 2019-05-08 Sprint
- Assigned To set to Peter Amstutz
- Status changed from In Progress to Resolved
- % Done changed from 0 to 100
Also available in: Atom
PDF