Bug #21901
closedReduce redundant file_download logging when client uses multi-threaded/reassembly approach
Added by Tom Clegg 5 months ago. Updated 3 months ago.
Description
Background¶
Currently, if a client (like aws s3 cp ...
) launches multiple threads to download file segments concurrently and assemble them on the client side, and the WebDAVLogEvents
config is enabled, each file segment generates a new entry in the logs table. This causes excessive load and misleading statistics when downloading, for example, a multi-gigabyte file in 8 MiB segments.
Proposed behavior¶
Keep-web should maintain an in-memory lookup table of requests that appear to be part of an ongoing multi-request download that has already been logged, and skip the file_download
log for subsequent segments. Something like:
- Generate a download event key comprising the client IP address (X-Forwarded-For), token, collection ID, and filename
- If the key is already in the "ongoing download" table with a recent timestamp, and the requested range does not include the first byte of the file, just update the timestamp in the table and don't generate a
file_download
log entry - Otherwise, add the key to the table and generate a
file_download
log entry
The definition of "recent" should be configurable, default 30 seconds. If configured to 0, this log consolidation behavior should be disabled.
Related issues
Updated by Tom Clegg 5 months ago
- Related to Bug #21748: awscli downloads from keep-web slowly added
Updated by Peter Amstutz 5 months ago
- Target version set to Development 2024-07-03 sprint
Updated by Peter Amstutz 5 months ago
- Target version changed from Development 2024-07-03 sprint to Development 2024-07-24 sprint
Updated by Tom Clegg 4 months ago
- add a struct type (
multipartRequestKey
orlogDedupKey
?) with the fields that will be equal for all requests in a multipart download -- something like userUUID, collectionUUID, collectionPDH, filepath - add a "last logged" lookup table and a mutex to protect concurrent accesses, to
handler
:lastLogged map[multipartRequestKey]time.Time lastLoggedMtx sync.Mutex lastLoggedTidied time.Time
- add a
*handler
method (logIsRedundant
?) that locks the mutex and looks up / updates the relevant entry in h.lastLogged, and periodically deletes old entries with something similar to the lockTidied code (the existingh.lock = map[string]...
code is also a good example of a convenient place to initialize the empty map) - near the top of logUploadOrDownload(), after user UUID and collection UUID/PDH are known,
if r.Method == "GET" && h.logIsRedundant(filepath, collection, user, time.Now()) { return }
(I'm thinking adding an explicit time argument would make it easier to test without a lot of time.Sleep())
Note logUploadAndDownload is also responsible for logging "file download" messages to stderr even when WebDAVLogEvents is turned off. I'm assuming we want those to be deduplicated as well.
Updated by Peter Amstutz 4 months ago
- Target version changed from Development 2024-07-24 sprint to Development 2024-08-07 sprint
Updated by Brett Smith 3 months ago
One consequence of the design as written is: imagine a scenario where multiple users behind NAT download the same file anonymously. Because they look identical to keep-web (using the same client API token and IP address), logs for these download events will be squashed, even though they are in fact multiple downloads.
I think this problem is difficult to solve in the general case, but do we want to implement some sort of heuristic to prevent it? Maybe re-log if the same client requests the same range within the time interval? That's not foolproof but it would cover cases like "two users behind NAT download the same file at the same time with a same-enough version of aws s3 cp
."
Updated by Brett Smith 3 months ago
21901-file-log-throttling @ 4fa001b0b9af167667130b2336aca1c2bce15a74 - developer-run-tests: #4370
The existing code already had two parallel code paths for logging: one for local logging, and another for Arvados API logging. I tried to clean this up by having the key struct store all the data we want to log, and then export that for local logging and/or Arvados API logging as desired. This both reduces repetetion and means that if we want to log more details in the future, those details will, by default, feed into the log deduplication functionality.
- All agreed upon points are implemented / addressed.
- Yes
- Anything not implemented (discovered or discussed during work) has a follow-up story.
- N/A
- Code is tested and passing, both automated and manual, what manual testing was done is described
- Added lots of tests—see results above
- Documentation has been updated.
- Yes
- Behaves appropriately at the intended scale (describe intended scale).
- Improves scale by reducing logging overhead
- Considered backwards and forwards compatibility issues between client and server.
- No change in API compatibility
- Follows our coding standards and GUI style guidelines.
- Yes
Updated by Peter Amstutz 3 months ago
- Target version changed from Development 2024-08-07 sprint to Development 2024-08-28 sprint
Updated by Tom Clegg 3 months ago
Brett Smith wrote in #note-10:
One consequence of the design as written is: imagine a scenario where multiple users behind NAT download the same file anonymously. Because they look identical to keep-web (using the same client API token and IP address), logs for these download events will be squashed, even though they are in fact multiple downloads.
The proposed "...and the requested range does not include the first byte of the file..." condition would cover the most obvious/common cases of that, and would also avoid dropping logs when a file is downloaded multiple times in a row by the same user. I'm guessing you didn't see that bit.
21901-file-log-throttling @ 4fa001b0b9af167667130b2336aca1c2bce15a74
TestLogXForwardedFor should test some values with host:port and [host]:port.
According to https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/X-Forwarded-For:- we should expect multiple X-Forwarded-For headers
- the client of the proxy closest to the server (which is the one whose IP we want to log, unless we recognize it as a trusted proxy itself, which would require additional configuration which we probably don't want to explore right now) is listed last, not first
Rest LGTM, thanks.
Updated by Brett Smith 3 months ago
Tom Clegg wrote in #note-13:
The proposed "...and the requested range does not include the first byte of the file..." condition would cover the most obvious/common cases of that, and would also avoid dropping logs when a file is downloaded multiple times in a row by the same user. I'm guessing you didn't see that bit.
I did see it, my issue was more that it wasn't clear if #note-7 was intended to completely supersede the ticket description or not or not (e.g., they suggest keying on different sets of values). And, at the time, I didn't read the sequence carefully enough, and I thought requests past the first byte would never get logged, and that seemed worse.
Going over it again, I see I misunderstood. At the same time, even with this version, I'll point out a variant of the situation I was worried about is still possible: someone behind the NAT just needs to send their requests with Range: bytes=1-
, and they'll get basically the whole file without being logged.
Of course, for sure, at this point we are well into "someone malicious is actively trying to avoid detection" territory. But, we do document WebDAV logging generally as an auditability option, so I do think it's worth at least thinking about this.
If we wanted to completely address this, I think what we would have to do is:
- the values in the cache map are a struct that includes a timestamp and a slice of range value pairs
- a download request gets logged, and a new cache value created, if any of the following are true:
- the key is not in the cache map
- the timestamp value in the cache map is past the configured time interval
- there is any overlap between the range of this request and the range value pairs in the slice
- otherwise, the cache map value gets updated by appending this request's range to the slice
Is this reasonable? Worth doing?
Updated by Peter Amstutz 3 months ago
Brett Smith wrote in #note-14:
- there is any overlap between the range of this request and the range value pairs in the slice
- otherwise, the cache map value gets updated by appending this request's range to the slice
Wouldn't the second download then trigger a log on every range request, because they had already been requested?
I think Tom's original heuristic of "record log when the client is new, or the request includes byte 0" is reasonable.
Honestly, I'd be fine if it didn't take range into account at all, and it only chooses to create a log depending on whether there's been a request from the same client for the same file in the last N seconds. From an auditing standpoint, we want to know the file access happened at all; exactly how many times it happened is less important.
(From a metrics standpoint it would be nice to have more precise numbers to measure demand for a particular file, but that isn't the goal of this ticket).
Updated by Brett Smith 3 months ago
Peter Amstutz wrote in #note-15:
Wouldn't the second download then trigger a log on every range request, because they had already been requested?
When I talk about ranges overlapping, what I mean is like:
Imagine the same client requests bytes 0-9, 20-29, and 10-19. None of these ranges overlap, so only the first request gets logged.
Now imagine a different same client requests bytes 0-9, 20-29, and 5-15. The first request would be logged. The second request would not because there's no range overlap. The third request would get logged, and create a new cache entry, because there's overlap between the previous range 0-9 and the request's range 5-15.
From an auditing standpoint, we want to know the file access happened at all; exactly how many times it happened is less important.
This is a totally reasonable take and I don't have any problem with it as a logging philosophy, I just wonder whether our users will agree this meets their logging needs. Which is why I'm sensitive to the fact that we specifically document this as an auditing log. If it was just a general informational log I'd care less. But we specifically present these logs as part of a file access auditing solution and security departments might have pretty strict expectations of what that means.
I guess we can just cover this in documentation of the option? Be very explicit that multiple requests within WebDAVLogDownloadInterval
might not be logged, and therefore set it accordingly? I'd be fine with the byte 0 heuristic+that punt.
Updated by Tom Clegg 3 months ago
- it's not a complete download -- perhaps the client skips the first byte specifically to reduce the probability of getting logged? -- and
- we already logged an identical download event <30 seconds ago when an indistinguishable client at the same IP address (or behind the same NAT gateway) downloaded the same file using the same token (or both accesses were anonymous/unauthenticated).
I guess we could also note that even with the deduplication feature disabled entirely, we still can't guarantee logs for multiple identical downloads from behind someone else's network device, because caching proxy servers exist.
Updated by Tom Clegg 3 months ago
Brett Smith wrote in #note-14:
If we wanted to completely address this, I think what we would have to do is:
...
- there is any overlap between the range of this request and the range value pairs in the slice
...Is this reasonable? Worth doing?
I don't think it's worth tracking all of the ranges that have been downloaded in order to guarantee a log when any byte is downloaded twice. I think the "always log if byte 0 is downloaded" rule is enough.
Maybe we need to clarify in the docs that this is not a suitable tool for detecting unauthorized access, but rather a tool to track normal access by authorized users.
Updated by Brett Smith 3 months ago
Background on Range header parsing: in Go's http
package this is handled in fs.go
by the parseRange
function. This is called from inside ServeContent
and neither the function nor any of its results are exported anywhere we can see them.
The third-party http_range
package is literally just a copy+paste of the stdlib code with public exports. On the one hand, that feels silly, but on the other hand, I would like to stay as close as possible to that code to make sure we have consistency between our logging logic and our serving logic. So, I guess I'll just use that rather than copy+pasting it ourselves.
Updated by Brett Smith 3 months ago
21901-file-log-throttling @ 5b1f5fec91615f987b6451c128f26a26f41d6644 - developer-run-tests: #4402
Implementing the "does the request get byte 0" heuristic requires parsing the Range header. In order to avoid inconsistent logic within keep-web, I wanted to use the same parser that Go's HTTP server itself uses. This required pulling in a new module; see #note-19.
In addition, the parsing code wants to know the size of the file being requested, because it standardizes the requested ranges to absolute byte positions. This means, assuming I stick to the above, I can't do any Range header parsing if the file doesn't exist or otherwise can't determine its size. I erred on the side of logging in this case. (A koan: is it meaningful to say a client requested byte 0 of a file that doesn't exist?)
I also noticed, even in the "normal" case of aws s3 cp
blasting a bunch of non-overlapping partial content requests in parallel, there's a decent chance this logic means one logical "request" will be logged twice: once for whatever GET
comes in first, and again for whatever GET
requests byte 0, if they happen not to be the same request. I don't see any way to prevent that that doesn't implement the level of tracking that I proposed earlier, so I assume we'll just live with this, even though it could come off as weirdly inconsistent to administrators.
TestLogXForwardedFor should test some values with host:port and [host]:port.
The MDN documentation you linked never says that the port can appear in a valid value. It doesn't even mention the word "port." So I don't see any reason to do this.
we should expect multiple X-Forwarded-For headers
I believe this is covered by TestLogXForwardedForMultivalue
.
Updated by Tom Clegg 3 months ago
Brett Smith wrote in #note-20:
I also noticed, even in the "normal" case of
aws s3 cp
blasting a bunch of non-overlapping partial content requests in parallel, there's a decent chance this logic means one logical "request" will be logged twice: once for whateverGET
comes in first, and again for whateverGET
requests byte 0, if they happen not to be the same request. I don't see any way to prevent that that doesn't implement the level of tracking that I proposed earlier, so I assume we'll just live with this, even though it could come off as weirdly inconsistent to administrators.
Yes, I think that's fine. "Sometimes 1 redundant log in same interval" isn't perfect, but it's better than "usually lots of redundant logs". And anything parsing the logs needs to accommodate dups anyway, because some multi-request downloads take longer than WebDAVLogDownloadInterval
.
TestLogXForwardedFor should test some values with host:port and [host]:port.
The MDN documentation you linked never says that the port can appear in a valid value. It doesn't even mention the word "port." So I don't see any reason to do this.
Sorry, I misread the use of SplitHostPort -- we only use that on the req RemoteAddr, not the X-Forwarded-For header values.
(The Forwarded header, which I guess we'll convert to someday, does sometimes include ports.)
we should expect multiple X-Forwarded-For headers
I believe this is covered by
TestLogXForwardedForMultivalue
.
Ah, yes.
LGTM, thanks.
Updated by Brett Smith 3 months ago
- Status changed from In Progress to Resolved