Project

General

Profile

Idea #11065

Updated by Tom Clegg about 7 years ago

h2. Background 

 The logs table currently serves three purposes: 
 * an audit log, permitting admins and users to look up the time and details of past changes to Arvados objects via arvados.v1.logs.* endpoints 
 * a mechanism for passing cache-invalidation events, used by the puma and Go websocket servers, the Python SDK "events" library, and arvados-cwl-runner to detect when an object has changed 
 * a staging area for stdout/stderr text coming from users' jobs/containers, permitting users to see what their jobs/containers are doing while they are still running (i.e., before those text files are written to Keep). 

 Long term plans: 
 * The cache-invalidation mechanism will not rely on the logs table at all. The puma websocket server will retire. The Go websocket server will use a more efficient event-passing system -- perhaps something like nsq. 
 * Audit logs will be completely optional; will use a better schema that supports search; will shard by time span or use some other approach to prevent unbounded growth; and will be separate from the Arvados object database itself. 

 h2. Problem to address here 

 The logs table grows indefinitely, even on sites where policy does not require an audit log. A huge logs table makes backups, migrations, and upgrades unnecessarily slow and painful. 

 h2. Proposed fix 

 Add an API server config entry establishing the maximum time interval worth preserving in the logs table. 

 <pre><code class="yaml"> 
 # Time to keep audit logs (a row in the log table added each time an 
 # Arvados object is created, modified, or deleted) in the Postgresql 
 # database. Currently, websocket event notifications rely on audit 
 # logs, so this should not be set lower than 600 (5 minutes). 
 # 
 # If max_audit_log_age is 0, log entries will never be deleted by 
 # Arvados. Cleanup can be done by an external process without 
 # affecting any Arvados system processes, as long as very recent 
 # (<5 minutes old) logs are not deleted. 
 max_audit_log_age: max_log_row_age: 1209600 
 </code></pre> 

 In the API server, periodically (every <code class="ruby">(max_audit_log_age/14)</code> class="ruby">(max_log_row_age/14)</code> seconds) delete all log rows older than max_audit_log_age. 
 * If max_audit_log_age is zero, never delete old logs. 
 * Only delete logs with event_type &in; {create, update, destroy, delete} -- don't touch job/container stderr logs (they are handled by the existing "delete job/container logs" rake tasks) max_log_row_age. 

 This should be done in a background thread so it doesn't delay the current API response. 
 * Similar to SweepTrashedCollections: add a hook The SweepTrashedCollections approach might work well here too: after each write transaction (create/update/delete). 
 * In a new thread, (create/update/delete), use the Rails cache to check whether we have already done a "delete deleted old logs" operation logs in the last N seconds. If so, end the thread. 
 * Create a file in Rails.tmp seconds, and if it doesn't exist already, and use flock() with LOCK_NB to check whether another thread is already running. If flock() fails, end the thread. 
 * While holding flock, execute the DELETE statement and update the "last cleanup" entry in the Rails cache. 
 * Drop flock and end the thread. not, delete old logs. 

 Extra notes: 
 * Considered adding max_log_row_count too, but that would be harder to implement (whereas max age is a simple SQL "delete" statement), it would be harder to choose a sensible value for a given site (the number of logs per second can vary widely), and it's much easier to reason about the functional effects of a time threshold. 

Back