Bug #13491
closedarvbox deadlocks on parallel usage
Description
Running the CWL conformance tests with arvbox takes an hour, which is far too long.
Peter Amstutz suggested running them in parallel mode, but that is producing deadlock errors.
Using the head of the primary branch and running the following commands:
docker pull arvados/arvbox-demo:latest sdk/cwl/test_with_arvbox.sh --config localdemo --leave-running --junit-xml=/tmp/junit.xml -j4
Note the use of `-j4` which leads to parallel calls to arvados-cwl-runner
.
Test failed: /tmp/cwltest/arv-cwl-containers --compute-checksum --outdir=/tmp/tmpQ_esdz --quiet v1.0/stderr-shortcut.cwl v1.0/empty.json Test command line with stderr redirection, brief syntax Returned non-zero 2018-05-16 07:09:21 arvados.cwl-runner ERROR: [container stderr-shortcut.cwl] got error <HttpError 422 when requesting https://localhost:8000/arvados/v1/container_requests?alt=json returned "#<PG::TRDeadlockDetected: ERROR: deadlock detected DETAIL: Process 12103 waits for ExclusiveLock on relation 16511 of database 16443; blocked by process 6629. Process 6629 waits for ExclusiveLock on relation 16498 of database 16443; blocked by process 12103. HINT: See server log for query details. >"> 2018-05-16 07:09:21 arvados.cwl-runner WARNING: Overall process status is permanentFail 2018-05-16 07:09:21 cwltool WARNING: Final process status is permanentFail
Full log: https://ci.commonwl.org/job/arvados-conformance/836/console
Updated by Ward Vandewege over 6 years ago
- Related to Bug #13594: PG::TRDeadlockDetected when running cwl tests in parallel added
Updated by Ward Vandewege over 6 years ago
- Status changed from New to Resolved
- Assigned To set to Ward Vandewege
- Target version set to 2018-06-20 Sprint
Hi Michael,
I ran into this as well in a different context. We worked on this bug in ticket #13594, and it is now fixed as of version 1.1.4.20180608190512-8 of the arvados-api-server package.
Thanks,
Ward.
Updated by Michael Crusoe over 6 years ago
Ward Vandewege wrote:
Hi Michael,
I ran into this as well in a different context. We worked on this bug in ticket #13594, and it is now fixed as of version 1.1.4.20180608190512-8 of the arvados-api-server package.
Thanks,
Ward.
Great! Shall I update ci.commonwl.org to run the tests in parallel, or when will be incorporated in the arvados/arvbox-demo:latest container ?
Updated by Ward Vandewege over 6 years ago
Michael Crusoe wrote:
Ward Vandewege wrote:
Hi Michael,
I ran into this as well in a different context. We worked on this bug in ticket #13594, and it is now fixed as of version 1.1.4.20180608190512-8 of the arvados-api-server package.
Thanks,
Ward.Great! Shall I update ci.commonwl.org to run the tests in parallel, or when will be incorporated in the arvados/arvbox-demo:latest container ?
The fix is already present in the current version of the arvados/arvbox-demo:latest image published on Docker Hub, so we could try on ci.commonwl.org. There is a point in the CWL test suite that requires a number of compute nodes to be available that is 2x the concurrency number given with the -j parameter. Certainly for small numbers of -j. So if you want to use -j4, make sure arvbox has 8 compute 'nodes' available.