Project

General

Profile

Actions

Idea #11065

closed

[API] Delete rows from logs table when they exceed a configured threshold

Added by Tom Clegg about 7 years ago. Updated about 7 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
API
Target version:
Story points:
2.0

Description

Background

The logs table currently serves three purposes:
  • an audit log, permitting admins and users to look up the time and details of past changes to Arvados objects via arvados.v1.logs.* endpoints
  • a mechanism for passing cache-invalidation events, used by the puma and Go websocket servers, the Python SDK "events" library, and arvados-cwl-runner to detect when an object has changed
  • a staging area for stdout/stderr text coming from users' jobs/containers, permitting users to see what their jobs/containers are doing while they are still running (i.e., before those text files are written to Keep).
Long term plans:
  • The cache-invalidation and live job/container-logging systems will not rely on the logs table at all. The puma websocket server will retire. The Go websocket server will use a more efficient event-passing system -- perhaps something like nsq.
  • Recent job/container-logging data will be also accessible some other way -- perhaps a combination of a periodically-flushed log collection and an in-memory buffer.
  • Audit logs will be completely optional; will use a better schema that supports search; will shard by time span or use some other approach to prevent unbounded growth; and will be separate from the Arvados object database itself.

Problem to address here

The logs table grows indefinitely, even on sites where policy does not require an audit log. A huge logs table makes backups, migrations, and upgrades unnecessarily slow and painful.

Proposed fix

Add an API server config entry establishing the maximum time interval worth preserving in the logs table.

# Time to keep audit logs (a row in the log table added each time an
# Arvados object is created, modified, or deleted) in the PostgreSQL
# database. Currently, websocket event notifications rely on audit
# logs, so this should not be set lower than 600 (5 minutes).
max_audit_log_age: 1209600

...and to prevent surprises and avoid bad database behavior -- especially the first time the cleanup job runs on an existing cluster with a huge backlog -- a maximum number of rows to delete in a single transaction.

# Maximum number of log rows to delete in a single SQL transaction.
#
# If max_audit_log_delete_batch is 0, log entries will never be
# deleted by Arvados. Cleanup can be done by an external process
# without affecting any Arvados system processes, as long as very
# recent (<5 minutes old) logs are not deleted.
#
# 100000 is a reasonable batch size for most sites.
max_audit_log_delete_batch: 0
In the API server, periodically (every (max_audit_log_age/14) seconds) delete all log rows older than max_audit_log_age.
  • If max_audit_log_age or max_audit_log_delete_batch is zero, never delete old logs.
  • If max_audit_log_age is non-zero but max_audit_log_delete_batch is zero, which is the default configuration, warn at startup (and, as stated in the previous point, never delete old logs).
  • Only delete logs with event_type ∈ {create, update, destroy, delete} -- don't touch job/container stderr logs (they are handled by the existing "delete job/container logs" rake tasks)
This should be done in a background thread so it doesn't delay the current API response.
  • Similar to SweepTrashedCollections: add a hook after each write transaction (create/update/delete).
  • In a new thread, use the Rails cache to check whether we have already done a "delete old logs" operation in the last N seconds. If so, end the thread.
  • Create a file in Rails.tmp if it doesn't exist already, and use flock() with LOCK_NB to check whether another thread is already running. If flock() fails, end the thread.
  • While holding flock:
    • Execute a DELETE statement with limit=max_audit_log_delete_batch, commit the transaction, and repeat until fewer than max_audit_log_delete_batch rows are deleted.
    • Call sleep(max_audit_log_age/10000) between transactions to avoid keeping PostgreSQL too busy in scenarios where there are many rows to delete.
    • After each transaction, log a notice showing how many rows were deleted.
    • When the last transaction is done, update the "last cleanup" entry in the Rails cache.
  • Drop flock and end the thread.
Extra notes:
  • Considered adding max_log_row_count too, but that would be harder to implement (whereas max age is a simple SQL "delete" statement), it would be harder to choose a sensible value for a given site (the number of logs per second can vary widely), and it's much easier to reason about the functional effects of a time threshold.

Subtasks 1 (0 open1 closed)

Task #11266: Review 11065-rotate-logsResolvedTom Clegg03/23/2017Actions

Related issues

Related to Arvados - Bug #11509: [Keep-web] Support CORS requests with Range headersResolvedTom Clegg04/17/2017Actions
Actions

Also available in: Atom PDF