Container request not finalized
CWL tests sometimes get stuck and then time out.
It does not happen consistently. Experimentally, running tests with -j5 seems to increase the odds of running into it. I've specifically noticed it with the arvbox-based tests running locally and on ci.commonwl.org. I'm not sure if I have observed it on the dev clusters.
What seems to happen is there is a container request where the underlying container is completed, but the container request is not finalized. The container request remains in "Committed" state, and the output or logs are not set. As a result, the workflow runner becomes cannot make progress.
#3 Updated by Peter Amstutz 6 months ago
Update: current theory is that this is a race condition between
Container.handle_completed and container reuse.
- (A) Container completion starts
- (A) Container gets list of container requests to finalize
- (B) New container request is submitted
- (B) New container request finds container to reuse
- (B) Container appears to still be Running (because update to Complete happened in the other transaction, which hasn't committed yet) so it doesn't finalize
- (A) Container completion finalizes container requests found in step 2, which does not include new container request
Result: new container request never gets finalized.
- Container.handle_completed gets a row level lock on the container, and then on each container request (only when creating a new retry container)
- ContainerRequest.finalize_if_needed gets a row level lock on the container, then the container request, only then checks state and determines whether to finalize.
#5 Updated by Peter Amstutz 6 months ago
15164-cr-finalize-lock @ 24301058687be0d42883871d168c15dac98668c2
Addresses race condition between container completion and container
reuse. Without this locking, a container request can resolve and
attempt to reuse a container which is concurrently being completed,
resulting in a race condition that results in the container request
never being finalized.