Bug #13500
closedcrunch-dispatch-slurm PG::TRDeadlockDetected: ERROR: deadlock detected
Description
We have crunch-dispatch-slurm 1.1.4.20180511205416-1
root@arvados-master-eglyx:~# dpkg -l |grep crunch-dis ii crunch-dispatch-slurm 1.1.4.20180511205416-1 amd64 Dispatch Crunch containers to a SLURM cluster
The logs are getting messages like this about 10 times per minute:
May 17 15:43:14 arvados-master-eglyx crunch-dispatch-slurm[32079]: DETAIL: Process 41291 waits for ExclusiveLock on relation 16440 of database 16385; blocked by process 46646. May 17 15:43:14 arvados-master-eglyx crunch-dispatch-slurm[32079]: Process 46646 waits for ExclusiveLock on relation 16453 of database 16385; blocked by process 41291. May 17 15:43:14 arvados-master-eglyx crunch-dispatch-slurm[32079]: HINT: See server log for query details. May 17 15:43:14 arvados-master-eglyx crunch-dispatch-slurm[32079]: > (422: 422 Unprocessable Entity) returned by arvados-api-eglyx.hgi.sanger.ac.uk May 17 15:43:14 arvados-master-eglyx crunch-dispatch-slurm[32079]: 2018/05/17 15:43:14 Done monitoring container eglyx-dz642-e6zsrf34kxvm0un May 17 15:43:19 arvados-master-eglyx crunch-dispatch-slurm[32079]: 2018/05/17 15:43:19 Error updating container eglyx-dz642-ed9d98nc35cy676 to state "Cancelled": arvados API server error: #<PG::TRDeadlockDetected: ERROR: deadlock detected May 17 15:43:19 arvados-master-eglyx crunch-dispatch-slurm[32079]: DETAIL: Process 41982 waits for ExclusiveLock on relation 16440 of database 16385; blocked by process 46644. May 17 15:43:19 arvados-master-eglyx crunch-dispatch-slurm[32079]: Process 46644 waits for ExclusiveLock on relation 16453 of database 16385; blocked by process 43414. May 17 15:43:19 arvados-master-eglyx crunch-dispatch-slurm[32079]: Process 43414 waits for ExclusiveLock on relation 16440 of database 16385; blocked by process 41982. May 17 15:43:19 arvados-master-eglyx crunch-dispatch-slurm[32079]: HINT: See server log for query details. May 17 15:43:19 arvados-master-eglyx crunch-dispatch-slurm[32079]: > (422: 422 Unprocessable Entity) returned by arvados-api-eglyx.hgi.sanger.ac.uk May 17 15:43:19 arvados-master-eglyx crunch-dispatch-slurm[32079]: 2018/05/17 15:43:19 Done monitoring container eglyx-dz642-ed9d98nc35cy676 May 17 15:43:23 arvados-master-eglyx crunch-dispatch-slurm[32079]: 2018/05/17 15:43:23 Error updating container eglyx-dz642-su55175dq46ce5n to state "Cancelled": arvados API server error: #<PG::TRDeadlockDetected: ERROR: deadlock detected May 17 15:43:23 arvados-master-eglyx crunch-dispatch-slurm[32079]: DETAIL: Process 46743 waits for ExclusiveLock on relation 16440 of database 16385; blocked by process 42203. May 17 15:43:23 arvados-master-eglyx crunch-dispatch-slurm[32079]: Process 42203 waits for ExclusiveLock on relation 16453 of database 16385; blocked by process 42197. May 17 15:43:23 arvados-master-eglyx crunch-dispatch-slurm[32079]: Process 42197 waits for ExclusiveLock on relation 16440 of database 16385; blocked by process 46743. May 17 15:43:23 arvados-master-eglyx crunch-dispatch-slurm[32079]: HINT: See server log for query details. May 17 15:43:23 arvados-master-eglyx crunch-dispatch-slurm[32079]: > (422: 422 Unprocessable Entity) returned by arvados-api-eglyx.hgi.sanger.ac.uk May 17 15:43:23 arvados-master-eglyx crunch-dispatch-slurm[32079]: 2018/05/17 15:43:23 Done monitoring container eglyx-dz642-su55175dq46ce5n May 17 15:43:34 arvados-master-eglyx crunch-dispatch-slurm[32079]: 2018/05/17 15:43:34 Error updating container eglyx-dz642-h8w2hpksftit3fx to state "Cancelled": arvados API server error: #<PG::TRDeadlockDetected: ERROR: deadlock detected May 17 15:43:34 arvados-master-eglyx crunch-dispatch-slurm[32079]: DETAIL: Process 43623 waits for ExclusiveLock on relation 16440 of database 16385; blocked by process 41291. May 17 15:43:34 arvados-master-eglyx crunch-dispatch-slurm[32079]: Process 41291 waits for ExclusiveLock on relation 16453 of database 16385; blocked by process 29904. May 17 15:43:34 arvados-master-eglyx crunch-dispatch-slurm[32079]: Process 29904 waits for ExclusiveLock on relation 16440 of database 16385; blocked by process 43623. May 17 15:43:34 arvados-master-eglyx crunch-dispatch-slurm[32079]: HINT: See server log for query details. May 17 15:43:34 arvados-master-eglyx crunch-dispatch-slurm[32079]: > (422: 422 Unprocessable Entity) returned by arvados-api-eglyx.hgi.sanger.ac.uk May 17 15:43:34 arvados-master-eglyx crunch-dispatch-slurm[32079]: 2018/05/17 15:43:34 Done monitoring container eglyx-dz642-h8w2hpksftit3fx May 17 15:43:34 arvados-master-eglyx crunch-dispatch-slurm[32079]: 2018/05/17 15:43:34 Error updating container eglyx-dz642-mm5lolu4as26bog to state "Cancelled": arvados API server error: #<PG::TRDeadlockDetected: ERROR: deadlock detected
This does not seem to be having an adverse effect on performance, but I just wanted to report it as it seems to be new as of the latest version of crunch-dispatch-slurm.
Related issues
Updated by Joshua Randall over 6 years ago
Spoke too soon - this issue seems to eventually deadlock the entire system.
Updated by Joshua Randall over 6 years ago
I had to revert arvados-api-server to 1.1.4.20180509181910-8 in order to get any workflows to run (using arvados-cwl-runner). With the newer version, these deadlocks rapidly take over nearly all database connections, and no work gets scheduled in SLURM at all.
Updated by Peter Amstutz over 6 years ago
Quoting Josh from https://dev.arvados.org/issues/13164#note-17
These changes seems likely to be the cause of #13500 -- reverting arvados-api-server to the version prior to this revision resolved the issue. With these locking changes, our system was not able to schedule any container work whatsoever. Any attempts to run workflows using arvados-cwl-runner results in the database filling with waiting locks and eventually it starts reporting deadlocks. No jobs were submitted to SLURM at all.
The original description blames crunch-dispatch-slurm, but I think that's misleading. The probable cause is the API server changes in 2b74b0ddd3068f5d336ff1f86efefab2da3a587b
Follow up will be on that ticket.
Updated by Peter Amstutz over 6 years ago
- Related to Bug #13164: [API] dispatch sometimes tries to run cancelled containers at the expense of pending containers added
Updated by Tom Clegg over 6 years ago
- Status changed from New to In Progress
Aside: I suspect the reason these are being cancelled is that the containers#lock API calls failed earlier from deadlock.
Updated by Peter Amstutz over 6 years ago
Josh, if you are feeling adventurous, a bug fix for this problem is committed in 1.1.4.20180518191053
Updated by Ward Vandewege over 6 years ago
- Status changed from In Progress to Feedback
- Assigned To set to Tom Clegg
- Target version set to 2018-06-06 Sprint
Updated by Ward Vandewege over 6 years ago
- Related to Bug #13594: PG::TRDeadlockDetected when running cwl tests in parallel added