Project

General

Profile

Bug #20447

Updated by Peter Amstutz 12 months ago

I still need to collect some evidence but I have a theory: 

 # We put a "big lock" around the containers table, all write operations have to take an exclusive lock (currently this includes container operations that don't update priorities) 
 # This means all container operations now have to wait to get the lock 
 # We also added a feature whereby each time a "running containers probe" happens, it updates the "cost" on the API server 
 # This means write operations on containers are now happening much more frequently than just when containers change state 
 # As a result, requests involving containers are getting stacked up, filling up the request queue and making everything slow. 

 On the plus side, the behavior of the dispatcher to back off when it sees 500 errors seems to be successfully keeping the system load from spiraling out of control.

Back