Bug #13491

arvbox deadlocks on parallel usage

Added by Michael Crusoe over 1 year ago. Updated about 1 year ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:
Story points:
-
Release:
Release relationship:
Auto

Description

Running the CWL conformance tests with arvbox takes an hour, which is far too long.
Peter Amstutz suggested running them in parallel mode, but that is producing deadlock errors.

Using the head of the primary branch and running the following commands:

docker pull arvados/arvbox-demo:latest
sdk/cwl/test_with_arvbox.sh --config localdemo --leave-running --junit-xml=/tmp/junit.xml -j4

Note the use of `-j4` which leads to parallel calls to arvados-cwl-runner.

Test failed: /tmp/cwltest/arv-cwl-containers --compute-checksum --outdir=/tmp/tmpQ_esdz --quiet v1.0/stderr-shortcut.cwl v1.0/empty.json
Test command line with stderr redirection, brief syntax
Returned non-zero
2018-05-16 07:09:21 arvados.cwl-runner ERROR: [container stderr-shortcut.cwl] got error <HttpError 422 when requesting https://localhost:8000/arvados/v1/container_requests?alt=json returned "#<PG::TRDeadlockDetected: ERROR:  deadlock detected
DETAIL:  Process 12103 waits for ExclusiveLock on relation 16511 of database 16443; blocked by process 6629.
Process 6629 waits for ExclusiveLock on relation 16498 of database 16443; blocked by process 12103.
HINT:  See server log for query details.
>">
2018-05-16 07:09:21 arvados.cwl-runner WARNING: Overall process status is permanentFail
2018-05-16 07:09:21 cwltool WARNING: Final process status is permanentFail

Full log: https://ci.commonwl.org/job/arvados-conformance/836/console


Related issues

Related to Arvados - Bug #13594: PG::TRDeadlockDetected when running cwl tests in parallelResolved

History

#1 Updated by Ward Vandewege about 1 year ago

  • Related to Bug #13594: PG::TRDeadlockDetected when running cwl tests in parallel added

#2 Updated by Ward Vandewege about 1 year ago

  • Status changed from New to Resolved
  • Assigned To set to Ward Vandewege
  • Target version set to 2018-06-20 Sprint

Hi Michael,

I ran into this as well in a different context. We worked on this bug in ticket #13594, and it is now fixed as of version 1.1.4.20180608190512-8 of the arvados-api-server package.

Thanks,
Ward.

#3 Updated by Michael Crusoe about 1 year ago

Ward Vandewege wrote:

Hi Michael,

I ran into this as well in a different context. We worked on this bug in ticket #13594, and it is now fixed as of version 1.1.4.20180608190512-8 of the arvados-api-server package.

Thanks,
Ward.

Great! Shall I update ci.commonwl.org to run the tests in parallel, or when will be incorporated in the arvados/arvbox-demo:latest container ?

#4 Updated by Ward Vandewege about 1 year ago

Michael Crusoe wrote:

Ward Vandewege wrote:

Hi Michael,

I ran into this as well in a different context. We worked on this bug in ticket #13594, and it is now fixed as of version 1.1.4.20180608190512-8 of the arvados-api-server package.

Thanks,
Ward.

Great! Shall I update ci.commonwl.org to run the tests in parallel, or when will be incorporated in the arvados/arvbox-demo:latest container ?

The fix is already present in the current version of the arvados/arvbox-demo:latest image published on Docker Hub, so we could try on ci.commonwl.org. There is a point in the CWL test suite that requires a number of compute nodes to be available that is 2x the concurrency number given with the -j parameter. Certainly for small numbers of -j. So if you want to use -j4, make sure arvbox has 8 compute 'nodes' available.

#5 Updated by Tom Morris about 1 year ago

  • Release set to 13

Also available in: Atom PDF