Project

General

Profile

Idea #16222

Updated by Peter Amstutz almost 5 years ago

Containers send their live logs to the database.    Unfortunately when there are a very large number of containers, this can overwhelm the database and cause the API server to become non-responsive and start returning 503 errors. 

 We need better system behaviors and/or a new architecture so that large logging volumes do not cripple the system, and ideally don't require extensive tuning like the current logging parameters do, which only ever happens *after* a critical failure. 

 This solution should maintain two key features of the current system: 

 * Live logs are be delivered to the browser in a reasonable amount time (latency should be seconds, not minutes) 
 * Logs are stored for long enough that if a compute node running a container fails abruptly, there is a reasonable period where an admin doing a post-mortem can access logs leading right up until the point that that the compute node went away. 

Back