Story #11065

[API] Delete rows from logs table when they exceed a configured threshold

Added by Tom Clegg 3 months ago. Updated about 1 month ago.

Status:ResolvedStart date:03/23/2017
Priority:NormalDue date:
Assignee:Tom Clegg% Done:

100%

Category:API
Target version:2017-03-29 sprint
Story points2.0Remaining (hours)0.00 hour
Velocity based estimate-

Description

Background

The logs table currently serves three purposes:
  • an audit log, permitting admins and users to look up the time and details of past changes to Arvados objects via arvados.v1.logs.* endpoints
  • a mechanism for passing cache-invalidation events, used by the puma and Go websocket servers, the Python SDK "events" library, and arvados-cwl-runner to detect when an object has changed
  • a staging area for stdout/stderr text coming from users' jobs/containers, permitting users to see what their jobs/containers are doing while they are still running (i.e., before those text files are written to Keep).
Long term plans:
  • The cache-invalidation and live job/container-logging systems will not rely on the logs table at all. The puma websocket server will retire. The Go websocket server will use a more efficient event-passing system -- perhaps something like nsq.
  • Recent job/container-logging data will be also accessible some other way -- perhaps a combination of a periodically-flushed log collection and an in-memory buffer.
  • Audit logs will be completely optional; will use a better schema that supports search; will shard by time span or use some other approach to prevent unbounded growth; and will be separate from the Arvados object database itself.

Problem to address here

The logs table grows indefinitely, even on sites where policy does not require an audit log. A huge logs table makes backups, migrations, and upgrades unnecessarily slow and painful.

Proposed fix

Add an API server config entry establishing the maximum time interval worth preserving in the logs table.

# Time to keep audit logs (a row in the log table added each time an
# Arvados object is created, modified, or deleted) in the PostgreSQL
# database. Currently, websocket event notifications rely on audit
# logs, so this should not be set lower than 600 (5 minutes).
max_audit_log_age: 1209600

...and to prevent surprises and avoid bad database behavior -- especially the first time the cleanup job runs on an existing cluster with a huge backlog -- a maximum number of rows to delete in a single transaction.

# Maximum number of log rows to delete in a single SQL transaction.
#
# If max_audit_log_delete_batch is 0, log entries will never be
# deleted by Arvados. Cleanup can be done by an external process
# without affecting any Arvados system processes, as long as very
# recent (<5 minutes old) logs are not deleted.
#
# 100000 is a reasonable batch size for most sites.
max_audit_log_delete_batch: 0
In the API server, periodically (every (max_audit_log_age/14) seconds) delete all log rows older than max_audit_log_age.
  • If max_audit_log_age or max_audit_log_delete_batch is zero, never delete old logs.
  • If max_audit_log_age is non-zero but max_audit_log_delete_batch is zero, which is the default configuration, warn at startup (and, as stated in the previous point, never delete old logs).
  • Only delete logs with event_type ∈ {create, update, destroy, delete} -- don't touch job/container stderr logs (they are handled by the existing "delete job/container logs" rake tasks)
This should be done in a background thread so it doesn't delay the current API response.
  • Similar to SweepTrashedCollections: add a hook after each write transaction (create/update/delete).
  • In a new thread, use the Rails cache to check whether we have already done a "delete old logs" operation in the last N seconds. If so, end the thread.
  • Create a file in Rails.tmp if it doesn't exist already, and use flock() with LOCK_NB to check whether another thread is already running. If flock() fails, end the thread.
  • While holding flock:
    • Execute a DELETE statement with limit=max_audit_log_delete_batch, commit the transaction, and repeat until fewer than max_audit_log_delete_batch rows are deleted.
    • Call sleep(max_audit_log_age/10000) between transactions to avoid keeping PostgreSQL too busy in scenarios where there are many rows to delete.
    • After each transaction, log a notice showing how many rows were deleted.
    • When the last transaction is done, update the "last cleanup" entry in the Rails cache.
  • Drop flock and end the thread.
Extra notes:
  • Considered adding max_log_row_count too, but that would be harder to implement (whereas max age is a simple SQL "delete" statement), it would be harder to choose a sensible value for a given site (the number of logs per second can vary widely), and it's much easier to reason about the functional effects of a time threshold.

Subtasks

Task #11266: Review 11065-rotate-logsResolvedTom Clegg

Associated revisions

Revision 4600343d
Added by Tom Clegg about 1 month ago

Merge branch '11065-rotate-logs'

closes #11065

History

#1 Updated by Tom Clegg 3 months ago

  • Subject changed from [API] Delete rows from logs table when older than configured threshold to [API] Delete rows from logs table when they exceed a configured threshold
  • Description updated (diff)
  • Category set to API

#2 Updated by Tom Clegg 3 months ago

TBD:
  • What should the default be? 2w?
  • Any special considerations for existing sites that want to keep logs forever, and don't check the docs before upgrading?
  • Any special considerations for the very first "delete old logs" query, which will probably be quite slow on existing sites?
  • Worth using a shorter interval (like threshold/100) or backgrounding the query, to avoid producing inexplicably slow API calls when they happen to land on a garbage-collection threshold? (description updated to include this)

#3 Updated by Tom Clegg 3 months ago

  • Description updated (diff)

#4 Updated by Tom Morris 2 months ago

  • Target version set to Arvados Future Sprints

#5 Updated by Tom Morris about 1 month ago

  • Target version changed from Arvados Future Sprints to 2017-04-12 sprint
  • Story points changed from 0.5 to 2.0

#6 Updated by Tom Clegg about 1 month ago

  • Description updated (diff)

#7 Updated by Tom Morris about 1 month ago

  • Tracker changed from Bug to Story

#8 Updated by Tom Morris about 1 month ago

  • Target version changed from 2017-04-12 sprint to 2017-03-29 sprint

#9 Updated by Tom Clegg about 1 month ago

  • Assignee set to Tom Clegg

#10 Updated by Tom Clegg about 1 month ago

  • Status changed from New to In Progress

#12 Updated by Radhika Chippada about 1 month ago

  • “Time to keep audit logs (a row in the log table added … “ => “Time to keep audit logs (a row in the log table is added …”
  • Since we are using the default value from “common” in “test” config also, do we really need this at line 483: max_audit_log_delete_batch: 0
  • I couldn’t find where we are doing Rails.cache.write for ‘AuditLogs' and so I am not clear how this will work: Rails.cache.fetch('AuditLogs', expires_in: exp)
  • Do we need File.owned?(Rails.root.join('tmp')) here? Can it be sufficient if one has write permission and can create the lock file here instead?
  • This test name can probably clarified as “retain old audit logs with default setting of max_audit_log_delete_batch=0”
  • Test ‘retain old audit logs with default settings’ and 'retain old audit logs with batch=0' are basically doing the same, once we clarify the name of the first test …
  • tidy_in_background actually returns if max_audit_log_age configured as zero, but the test 'delete old audit logs in multiple batches' ignores such a restriction by passing it and indeed uses zero, which is confusing. Suggest using a value such as 0.1 instead.
  • The test name delete old audit logs with production settings' is a bit misleading because max_age < 5min is not production-like per the max age config parameter documentation and you might want to use 5min instead in this test.

#13 Updated by Tom Clegg about 1 month ago

Radhika Chippada wrote:

  • “Time to keep audit logs (a row in the log table added … “ => “Time to keep audit logs (a row in the log table is added …”

Reworded.

  • Since we are using the default value from “common” in “test” config also, do we really need this at line 483: max_audit_log_delete_batch: 0

I guess not. I had added to ensure the test suite doesn't auto-delete even if we start enabling it by default. But maybe it's less confusing if we wait and only do that if/when we get there.

  • I couldn’t find where we are doing Rails.cache.write for ‘AuditLogs' and so I am not clear how this will work: Rails.cache.fetch('AuditLogs', expires_in: exp)

The block form of fetch() runs the block and updates the cache with the returned value if the cached value is expired/nonexistent.

  • Do we need File.owned?(Rails.root.join('tmp')) here? Can it be sufficient if one has write permission and can create the lock file here instead?

This addresses the situation where a rake task runs as root. root can write files to the cache directory, but if it does, the www-data won't be able to update/delete them later.

  • This test name can probably clarified as “retain old audit logs with default setting of max_audit_log_delete_batch=0”
  • Test ‘retain old audit logs with default settings’ and 'retain old audit logs with batch=0' are basically doing the same, once we clarify the name of the first test …

Added comments to clarify purpose of separate tests.

  • tidy_in_background actually returns if max_audit_log_age configured as zero, but the test 'delete old audit logs in multiple batches' ignores such a restriction by passing it and indeed uses zero, which is confusing. Suggest using a value such as 0.1 instead.

Updated.

  • The test name delete old audit logs with production settings' is a bit misleading because max_age < 5min is not production-like per the max age config parameter documentation and you might want to use 5min instead in this test.

I just didn't think it was worth changing the test fixtures over it... added a comment to explain the anomaly.

11065-rotate-logs @ 6722b420effcab24dffd9b47fa277f8830bd4cca

#14 Updated by Radhika Chippada about 1 month ago

LGTM

#15 Updated by Tom Clegg about 1 month ago

  • Status changed from In Progress to Resolved
  • % Done changed from 0 to 100

Applied in changeset arvados|commit:4600343d1bff7ac4f7b9f08486541444c31af8b6.

Also available in: Atom PDF