Idea #11065: [API] Delete rows from logs table when they exceed a configured threshold - Arvados

Actions

Copy link

Idea #11065

closed

[API] Delete rows from logs table when they exceed a configured threshold

Added by Tom Clegg about 8 years ago. Updated about 8 years ago.

Status:

Resolved

Priority:

Normal

Assigned To:

Tom Clegg

Category:

API

Target version:

2017-03-29 sprint

Start date:

03/23/2017

Due date:

Story points:

2.0

Description

Background¶

The logs table currently serves three purposes:

an audit log, permitting admins and users to look up the time and details of past changes to Arvados objects via arvados.v1.logs.* endpoints
a mechanism for passing cache-invalidation events, used by the puma and Go websocket servers, the Python SDK "events" library, and arvados-cwl-runner to detect when an object has changed
a staging area for stdout/stderr text coming from users' jobs/containers, permitting users to see what their jobs/containers are doing while they are still running (i.e., before those text files are written to Keep).

Long term plans:

The cache-invalidation and live job/container-logging systems will not rely on the logs table at all. The puma websocket server will retire. The Go websocket server will use a more efficient event-passing system -- perhaps something like nsq.
Recent job/container-logging data will be also accessible some other way -- perhaps a combination of a periodically-flushed log collection and an in-memory buffer.
Audit logs will be completely optional; will use a better schema that supports search; will shard by time span or use some other approach to prevent unbounded growth; and will be separate from the Arvados object database itself.

Problem to address here¶

The logs table grows indefinitely, even on sites where policy does not require an audit log. A huge logs table makes backups, migrations, and upgrades unnecessarily slow and painful.

Proposed fix¶

Add an API server config entry establishing the maximum time interval worth preserving in the logs table.

# Time to keep audit logs (a row in the log table added each time an
# Arvados object is created, modified, or deleted) in the PostgreSQL
# database. Currently, websocket event notifications rely on audit
# logs, so this should not be set lower than 600 (5 minutes).
max_audit_log_age: 1209600

...and to prevent surprises and avoid bad database behavior -- especially the first time the cleanup job runs on an existing cluster with a huge backlog -- a maximum number of rows to delete in a single transaction.

# Maximum number of log rows to delete in a single SQL transaction.
#
# If max_audit_log_delete_batch is 0, log entries will never be
# deleted by Arvados. Cleanup can be done by an external process
# without affecting any Arvados system processes, as long as very
# recent (<5 minutes old) logs are not deleted.
#
# 100000 is a reasonable batch size for most sites.
max_audit_log_delete_batch: 0

In the API server, periodically (every (max_audit_log_age/14) seconds) delete all log rows older than max_audit_log_age.

If max_audit_log_age or max_audit_log_delete_batch is zero, never delete old logs.
If max_audit_log_age is non-zero but max_audit_log_delete_batch is zero, which is the default configuration, warn at startup (and, as stated in the previous point, never delete old logs).
Only delete logs with event_type ∈ {create, update, destroy, delete} -- don't touch job/container stderr logs (they are handled by the existing "delete job/container logs" rake tasks)

This should be done in a background thread so it doesn't delay the current API response.

Similar to SweepTrashedCollections: add a hook after each write transaction (create/update/delete).
In a new thread, use the Rails cache to check whether we have already done a "delete old logs" operation in the last N seconds. If so, end the thread.
Create a file in Rails.tmp if it doesn't exist already, and use flock() with LOCK_NB to check whether another thread is already running. If flock() fails, end the thread.
While holding flock:
- Execute a DELETE statement with limit=max_audit_log_delete_batch, commit the transaction, and repeat until fewer than max_audit_log_delete_batch rows are deleted.
- Call sleep(max_audit_log_age/10000) between transactions to avoid keeping PostgreSQL too busy in scenarios where there are many rows to delete.
- After each transaction, log a notice showing how many rows were deleted.
- When the last transaction is done, update the "last cleanup" entry in the Rails cache.
Drop flock and end the thread.

Extra notes:

Considered adding max_log_row_count too, but that would be harder to implement (whereas max age is a simple SQL "delete" statement), it would be harder to choose a sensible value for a given site (the number of logs per second can vary widely), and it's much easier to reason about the functional effects of a time threshold.

Subtasks 1 (0 open — 1 closed)

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Tom Clegg about 8 years ago

Subject changed from [API] Delete rows from logs table when older than configured threshold to [API] Delete rows from logs table when they exceed a configured threshold
Description updated (diff)
Category set to API

Actions

Copy link

Updated by Tom Clegg about 8 years ago

TBD:

What should the default be? 2w?
Any special considerations for existing sites that want to keep logs forever, and don't check the docs before upgrading?
Any special considerations for the very first "delete old logs" query, which will probably be quite slow on existing sites?
~~Worth using a shorter interval (like threshold/100) or backgrounding the query, to avoid producing inexplicably slow API calls when they happen to land on a garbage-collection threshold?~~ (description updated to include this)

Actions

Copy link

Updated by Tom Clegg about 8 years ago

Description updated (diff)

Actions

Copy link

Updated by Tom Morris about 8 years ago

Target version set to Arvados Future Sprints

Actions

Copy link

Updated by Tom Morris about 8 years ago

Target version changed from Arvados Future Sprints to 2017-04-12 sprint
Story points changed from 0.5 to 2.0

Actions

Copy link

Updated by Tom Clegg about 8 years ago

Description updated (diff)

Actions

Copy link

Updated by Tom Morris about 8 years ago

Tracker changed from Bug to Idea

Actions

Copy link

Updated by Tom Morris about 8 years ago

Target version changed from 2017-04-12 sprint to 2017-03-29 sprint

Actions

Copy link

Updated by Tom Clegg about 8 years ago

Assigned To set to Tom Clegg

Actions

Copy link

#10

Updated by Tom Clegg about 8 years ago

Status changed from New to In Progress

Actions

Copy link

#11

Updated by Tom Clegg about 8 years ago

11065-rotate-logs @ 37e995fb7423b2b353c599e2a1b00bda7c29ee6f

Actions

Copy link

#12

Updated by Radhika Chippada about 8 years ago

“Time to keep audit logs (a row in the log table added … “ => “Time to keep audit logs (a row in the log table is added …”

Since we are using the default value from “common” in “test” config also, do we really need this at line 483: max_audit_log_delete_batch: 0

I couldn’t find where we are doing Rails.cache.write for ‘AuditLogs' and so I am not clear how this will work: Rails.cache.fetch('AuditLogs', expires_in: exp)

Do we need File.owned?(Rails.root.join('tmp')) here? Can it be sufficient if one has write permission and can create the lock file here instead?

This test name can probably clarified as “retain old audit logs with default setting of max_audit_log_delete_batch=0”

Test ‘retain old audit logs with default settings’ and 'retain old audit logs with batch=0' are basically doing the same, once we clarify the name of the first test …

tidy_in_background actually returns if max_audit_log_age configured as zero, but the test 'delete old audit logs in multiple batches' ignores such a restriction by passing it and indeed uses zero, which is confusing. Suggest using a value such as 0.1 instead.

The test name delete old audit logs with production settings' is a bit misleading because max_age < 5min is not production-like per the max age config parameter documentation and you might want to use 5min instead in this test.

Actions

Copy link

#13

Updated by Tom Clegg about 8 years ago

Radhika Chippada wrote:

“Time to keep audit logs (a row in the log table added … “ => “Time to keep audit logs (a row in the log table is added …”

Reworded.

Since we are using the default value from “common” in “test” config also, do we really need this at line 483: max_audit_log_delete_batch: 0

I guess not. I had added to ensure the test suite doesn't auto-delete even if we start enabling it by default. But maybe it's less confusing if we wait and only do that if/when we get there.

I couldn’t find where we are doing Rails.cache.write for ‘AuditLogs' and so I am not clear how this will work: Rails.cache.fetch('AuditLogs', expires_in: exp)

The block form of fetch() runs the block and updates the cache with the returned value if the cached value is expired/nonexistent.

Do we need File.owned?(Rails.root.join('tmp')) here? Can it be sufficient if one has write permission and can create the lock file here instead?

This addresses the situation where a rake task runs as root. root can write files to the cache directory, but if it does, the www-data won't be able to update/delete them later.

This test name can probably clarified as “retain old audit logs with default setting of max_audit_log_delete_batch=0”

Test ‘retain old audit logs with default settings’ and 'retain old audit logs with batch=0' are basically doing the same, once we clarify the name of the first test …

Added comments to clarify purpose of separate tests.

tidy_in_background actually returns if max_audit_log_age configured as zero, but the test 'delete old audit logs in multiple batches' ignores such a restriction by passing it and indeed uses zero, which is confusing. Suggest using a value such as 0.1 instead.

Updated.

The test name delete old audit logs with production settings' is a bit misleading because max_age < 5min is not production-like per the max age config parameter documentation and you might want to use 5min instead in this test.

I just didn't think it was worth changing the test fixtures over it... added a comment to explain the anomaly.

11065-rotate-logs @ 6722b420effcab24dffd9b47fa277f8830bd4cca

Actions

Copy link

#14

Updated by Radhika Chippada about 8 years ago

LGTM

Actions

Copy link

#15

Updated by Tom Clegg about 8 years ago

Status changed from In Progress to Resolved
% Done changed from 0 to 100

Applied in changeset arvados|commit:4600343d1bff7ac4f7b9f08486541444c31af8b6.

Actions

Copy link

#16

Updated by Tom Morris almost 6 years ago

Related to Bug #11509: [Keep-web] Support CORS requests with Range headers added

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Arvados

Custom queries

Idea #11065

[API] Delete rows from logs table when they exceed a configured threshold

Background¶

Problem to address here¶

Proposed fix¶

Updated by Tom Clegg about 8 years ago

Updated by Tom Clegg about 8 years ago

Updated by Tom Clegg about 8 years ago

Updated by Tom Morris about 8 years ago

Updated by Tom Morris about 8 years ago

Updated by Tom Clegg about 8 years ago

Updated by Tom Morris about 8 years ago

Updated by Tom Morris about 8 years ago

Updated by Tom Clegg about 8 years ago

Updated by Tom Clegg about 8 years ago

Updated by Tom Clegg about 8 years ago

Updated by Radhika Chippada about 8 years ago

Updated by Tom Clegg about 8 years ago

Updated by Radhika Chippada about 8 years ago

Updated by Tom Clegg about 8 years ago

Updated by Tom Morris almost 6 years ago