Bug #13500

crunch-dispatch-slurm PG::TRDeadlockDetected: ERROR: deadlock detected

Added by Joshua Randall over 1 year ago. Updated about 1 year ago.

Status:
Closed
Priority:
Normal
Assigned To:
Category:
Crunch
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:
Story points:
-
Release:
Release relationship:
Auto

Description

We have crunch-dispatch-slurm 1.1.4.20180511205416-1

root@arvados-master-eglyx:~# dpkg -l |grep crunch-dis
ii  crunch-dispatch-slurm            1.1.4.20180511205416-1                     amd64        Dispatch Crunch containers to a SLURM cluster

The logs are getting messages like this about 10 times per minute:

May 17 15:43:14 arvados-master-eglyx crunch-dispatch-slurm[32079]: DETAIL:  Process 41291 waits for ExclusiveLock on relation 16440 of database 16385; blocked by process 46646.
May 17 15:43:14 arvados-master-eglyx crunch-dispatch-slurm[32079]: Process 46646 waits for ExclusiveLock on relation 16453 of database 16385; blocked by process 41291.
May 17 15:43:14 arvados-master-eglyx crunch-dispatch-slurm[32079]: HINT:  See server log for query details.
May 17 15:43:14 arvados-master-eglyx crunch-dispatch-slurm[32079]: > (422: 422 Unprocessable Entity) returned by arvados-api-eglyx.hgi.sanger.ac.uk
May 17 15:43:14 arvados-master-eglyx crunch-dispatch-slurm[32079]: 2018/05/17 15:43:14 Done monitoring container eglyx-dz642-e6zsrf34kxvm0un
May 17 15:43:19 arvados-master-eglyx crunch-dispatch-slurm[32079]: 2018/05/17 15:43:19 Error updating container eglyx-dz642-ed9d98nc35cy676 to state "Cancelled": arvados API server error: #<PG::TRDeadlockDetected: ERROR:  deadlock detected
May 17 15:43:19 arvados-master-eglyx crunch-dispatch-slurm[32079]: DETAIL:  Process 41982 waits for ExclusiveLock on relation 16440 of database 16385; blocked by process 46644.
May 17 15:43:19 arvados-master-eglyx crunch-dispatch-slurm[32079]: Process 46644 waits for ExclusiveLock on relation 16453 of database 16385; blocked by process 43414.
May 17 15:43:19 arvados-master-eglyx crunch-dispatch-slurm[32079]: Process 43414 waits for ExclusiveLock on relation 16440 of database 16385; blocked by process 41982.
May 17 15:43:19 arvados-master-eglyx crunch-dispatch-slurm[32079]: HINT:  See server log for query details.
May 17 15:43:19 arvados-master-eglyx crunch-dispatch-slurm[32079]: > (422: 422 Unprocessable Entity) returned by arvados-api-eglyx.hgi.sanger.ac.uk
May 17 15:43:19 arvados-master-eglyx crunch-dispatch-slurm[32079]: 2018/05/17 15:43:19 Done monitoring container eglyx-dz642-ed9d98nc35cy676
May 17 15:43:23 arvados-master-eglyx crunch-dispatch-slurm[32079]: 2018/05/17 15:43:23 Error updating container eglyx-dz642-su55175dq46ce5n to state "Cancelled": arvados API server error: #<PG::TRDeadlockDetected: ERROR:  deadlock detected
May 17 15:43:23 arvados-master-eglyx crunch-dispatch-slurm[32079]: DETAIL:  Process 46743 waits for ExclusiveLock on relation 16440 of database 16385; blocked by process 42203.
May 17 15:43:23 arvados-master-eglyx crunch-dispatch-slurm[32079]: Process 42203 waits for ExclusiveLock on relation 16453 of database 16385; blocked by process 42197.
May 17 15:43:23 arvados-master-eglyx crunch-dispatch-slurm[32079]: Process 42197 waits for ExclusiveLock on relation 16440 of database 16385; blocked by process 46743.
May 17 15:43:23 arvados-master-eglyx crunch-dispatch-slurm[32079]: HINT:  See server log for query details.
May 17 15:43:23 arvados-master-eglyx crunch-dispatch-slurm[32079]: > (422: 422 Unprocessable Entity) returned by arvados-api-eglyx.hgi.sanger.ac.uk
May 17 15:43:23 arvados-master-eglyx crunch-dispatch-slurm[32079]: 2018/05/17 15:43:23 Done monitoring container eglyx-dz642-su55175dq46ce5n
May 17 15:43:34 arvados-master-eglyx crunch-dispatch-slurm[32079]: 2018/05/17 15:43:34 Error updating container eglyx-dz642-h8w2hpksftit3fx to state "Cancelled": arvados API server error: #<PG::TRDeadlockDetected: ERROR:  deadlock detected
May 17 15:43:34 arvados-master-eglyx crunch-dispatch-slurm[32079]: DETAIL:  Process 43623 waits for ExclusiveLock on relation 16440 of database 16385; blocked by process 41291.
May 17 15:43:34 arvados-master-eglyx crunch-dispatch-slurm[32079]: Process 41291 waits for ExclusiveLock on relation 16453 of database 16385; blocked by process 29904.
May 17 15:43:34 arvados-master-eglyx crunch-dispatch-slurm[32079]: Process 29904 waits for ExclusiveLock on relation 16440 of database 16385; blocked by process 43623.
May 17 15:43:34 arvados-master-eglyx crunch-dispatch-slurm[32079]: HINT:  See server log for query details.
May 17 15:43:34 arvados-master-eglyx crunch-dispatch-slurm[32079]: > (422: 422 Unprocessable Entity) returned by arvados-api-eglyx.hgi.sanger.ac.uk
May 17 15:43:34 arvados-master-eglyx crunch-dispatch-slurm[32079]: 2018/05/17 15:43:34 Done monitoring container eglyx-dz642-h8w2hpksftit3fx
May 17 15:43:34 arvados-master-eglyx crunch-dispatch-slurm[32079]: 2018/05/17 15:43:34 Error updating container eglyx-dz642-mm5lolu4as26bog to state "Cancelled": arvados API server error: #<PG::TRDeadlockDetected: ERROR:  deadlock detected

This does not seem to be having an adverse effect on performance, but I just wanted to report it as it seems to be new as of the latest version of crunch-dispatch-slurm.


Related issues

Related to Arvados - Bug #13164: [API] dispatch sometimes tries to run cancelled containers at the expense of pending containersResolved05/14/2018

Related to Arvados - Bug #13594: PG::TRDeadlockDetected when running cwl tests in parallelResolved

History

#1 Updated by Joshua Randall over 1 year ago

Spoke too soon - this issue seems to eventually deadlock the entire system.

#2 Updated by Joshua Randall over 1 year ago

I had to revert arvados-api-server to 1.1.4.20180509181910-8 in order to get any workflows to run (using arvados-cwl-runner). With the newer version, these deadlocks rapidly take over nearly all database connections, and no work gets scheduled in SLURM at all.

#3 Updated by Peter Amstutz over 1 year ago

Quoting Josh from https://dev.arvados.org/issues/13164#note-17

These changes seems likely to be the cause of #13500 -- reverting arvados-api-server to the version prior to this revision resolved the issue. With these locking changes, our system was not able to schedule any container work whatsoever. Any attempts to run workflows using arvados-cwl-runner results in the database filling with waiting locks and eventually it starts reporting deadlocks. No jobs were submitted to SLURM at all.

The original description blames crunch-dispatch-slurm, but I think that's misleading. The probable cause is the API server changes in 2b74b0ddd3068f5d336ff1f86efefab2da3a587b

Follow up will be on that ticket.

#4 Updated by Peter Amstutz over 1 year ago

  • Related to Bug #13164: [API] dispatch sometimes tries to run cancelled containers at the expense of pending containers added

#5 Updated by Tom Clegg over 1 year ago

  • Status changed from New to In Progress

Aside: I suspect the reason these are being cancelled is that the containers#lock API calls failed earlier from deadlock.

#6 Updated by Peter Amstutz over 1 year ago

Josh, if you are feeling adventurous, a bug fix for this problem is committed in 1.1.4.20180518191053

#7 Updated by Ward Vandewege over 1 year ago

  • Status changed from In Progress to Feedback
  • Assigned To set to Tom Clegg
  • Target version set to 2018-06-06 Sprint

#8 Updated by Tom Morris about 1 year ago

  • Status changed from Feedback to Closed

#9 Updated by Ward Vandewege about 1 year ago

  • Related to Bug #13594: PG::TRDeadlockDetected when running cwl tests in parallel added

#10 Updated by Tom Morris about 1 year ago

  • Release set to 13

Also available in: Atom PDF