Bug #21901
open
Reduce redundant file_download logging when client uses multi-threaded/reassembly approach
Added by Tom Clegg about 1 month ago.
Updated 7 days ago.
Release relationship:
Auto
Description
Background¶
Currently, if a client (like aws s3 cp ...
) launches multiple threads to download file segments concurrently and assemble them on the client side, and the WebDAVLogEvents
config is enabled, each file segment generates a new entry in the logs table. This causes excessive load and misleading statistics when downloading, for example, a multi-gigabyte file in 8 MiB segments.
Proposed behavior¶
Keep-web should maintain an in-memory lookup table of requests that appear to be part of an ongoing multi-request download that has already been logged, and skip the file_download
log for subsequent segments. Something like:
If the request has a Range header:
- Generate a download event key comprising the client IP address (X-Forwarded-For), token, collection ID, and filename
- If the key is already in the "ongoing download" table with a recent timestamp, and the requested range does not include the first byte of the file, just update the timestamp in the table and don't generate a
file_download
log entry
- Otherwise, add the key to the table and generate a
file_download
log entry
The definition of "recent" should be configurable, default 30 seconds. If configured to 0, this log consolidation behavior should be disabled.
- Related to Bug #21748: awscli downloads from keep-web slowly? added
- Target version set to Development 2024-07-03 sprint
- Assigned To set to Brett Smith
- Target version changed from Development 2024-07-03 sprint to Development 2024-07-24 sprint
Some implementation thoughts/hints
- add a struct type (
multipartRequestKey
or logDedupKey
?) with the fields that will be equal for all requests in a multipart download -- something like userUUID, collectionUUID, collectionPDH, filepath
- add a "last logged" lookup table and a mutex to protect concurrent accesses, to
handler
:
lastLogged map[multipartRequestKey]time.Time
lastLoggedMtx sync.Mutex
lastLoggedTidied time.Time
- add a
*handler
method (logIsRedundant
?) that locks the mutex and looks up / updates the relevant entry in h.lastLogged, and periodically deletes old entries with something similar to the lockTidied code (the existing h.lock = map[string]...
code is also a good example of a convenient place to initialize the empty map)
- near the top of logUploadOrDownload(), after user UUID and collection UUID/PDH are known,
if r.Method == "GET" && h.logIsRedundant(filepath, collection, user, time.Now()) { return }
(I'm thinking adding an explicit time argument would make it easier to test without a lot of time.Sleep())
Note logUploadAndDownload is also responsible for logging "file download" messages to stderr even when WebDAVLogEvents is turned off. I'm assuming we want those to be deduplicated as well.
Also available in: Atom
PDF