Project

General

Profile

Idea #11065

Updated by Tom Clegg about 7 years ago

h2. Background 

 The logs table currently serves three purposes: 
 * an audit log, permitting admins and users to look up the time and details of past changes to Arvados objects via arvados.v1.logs.* endpoints 
 * a mechanism for passing cache-invalidation events, used by the puma and Go websocket servers, the Python SDK "events" library, and arvados-cwl-runner to detect when an object has changed 
 * a staging area for stdout/stderr text coming from users' jobs/containers, permitting users to see what their jobs/containers are doing while they are still running (i.e., before those text files are written to Keep). 

 Long term plans: 
 * The cache-invalidation mechanism will not rely on the logs table at all. The puma websocket server will retire. The Go websocket server will use a more efficient event-passing system -- perhaps something like nsq. 
 * Audit logs will be completely optional; will use a better schema that supports search; will shard by time span or use some other approach to prevent unbounded growth; and will be separate from the Arvados object database itself. 

 h2. Problem to address here 

 The logs table grows indefinitely, even on sites where policy does not require an audit log. A huge logs table makes backups, migrations, and upgrades unnecessarily slow and painful. 

 h2. Proposed fix 

 Add an API server config entry establishing the maximum time interval worth preserving in the logs table. 

 <pre><code class="yaml"> 
 max_log_row_age: 1209600 
 </code></pre> 

 In the API server, periodically (every <code class="ruby">(max_log_row_age/14)</code> seconds) delete all log rows older than max_log_row_age. 

 The SweepTrashedCollections approach might work well here too: after each write transaction (create/update/delete), use the Rails cache to check whether we have deleted old logs in the last N seconds, and if not, delete old logs. 

 Extra notes: 
 * Considered adding max_log_row_count too, but that would be harder to implement (whereas max age is a simple SQL "delete" statement), it would be harder to choose a sensible value for a given site (the number of logs per second can vary widely), and it's much easier to reason about the functional effects of a time threshold. 

Back