Bug #21901: Reduce redundant file_download logging when client uses multi-threaded/reassembly approach - Arvados

Custom queries

All assigned issues
All issues assigned for grooming
My issues for grooming
My issues for grooming (no story pts)
Prioritized open issues

Actions

Copy link

Bug #21901

closed

Reduce redundant file_download logging when client uses multi-threaded/reassembly approach

Added by Tom Clegg 5 months ago. Updated 3 months ago.

Status:

Resolved

Priority:

Normal

Assigned To:

Brett Smith

Category:

Keep

Target version:

Development 2024-08-28 sprint

Story points:

Release:

Arvados 3.0

Release relationship:

Auto

Description

Background¶

Currently, if a client (like aws s3 cp ...) launches multiple threads to download file segments concurrently and assemble them on the client side, and the WebDAVLogEvents config is enabled, each file segment generates a new entry in the logs table. This causes excessive load and misleading statistics when downloading, for example, a multi-gigabyte file in 8 MiB segments.

Proposed behavior¶

Keep-web should maintain an in-memory lookup table of requests that appear to be part of an ongoing multi-request download that has already been logged, and skip the file_download log for subsequent segments. Something like:

If the request has a Range header:

Generate a download event key comprising the client IP address (X-Forwarded-For), token, collection ID, and filename
If the key is already in the "ongoing download" table with a recent timestamp, and the requested range does not include the first byte of the file, just update the timestamp in the table and don't generate a file_download log entry
Otherwise, add the key to the table and generate a file_download log entry

The definition of "recent" should be configurable, default 30 seconds. If configured to 0, this log consolidation behavior should be disabled.

Subtasks 1 (0 open — 1 closed)

Task #21916: Review 21901-file-log-throttling

Resolved

Brett Smith

08/23/2024

Actions

Related issues

Related to Arvados - Bug #21748: awscli downloads from keep-web slowly

Resolved

Tom Clegg

Actions

Issue # Delay: days Cancel

History
Notes
Property changes
Associated revisions

Actions

Copy link

Updated by Tom Clegg 5 months ago

Related to Bug #21748: awscli downloads from keep-web slowly added

Actions

Copy link

Updated by Tom Clegg 5 months ago

Release set to 70

Actions

Copy link

Updated by Peter Amstutz 5 months ago

Category set to Keep

Actions

Copy link

Updated by Peter Amstutz 5 months ago

Target version set to Development 2024-07-03 sprint

Actions

Copy link

Updated by Peter Amstutz 5 months ago

Assigned To set to Brett Smith

Actions

Copy link

Updated by Peter Amstutz 5 months ago

Target version changed from Development 2024-07-03 sprint to Development 2024-07-24 sprint

Actions

Copy link

Updated by Tom Clegg 4 months ago

Some implementation thoughts/hints

add a struct type (multipartRequestKey or logDedupKey?) with the fields that will be equal for all requests in a multipart download -- something like userUUID, collectionUUID, collectionPDH, filepath

add a "last logged" lookup table and a mutex to protect concurrent accesses, to handler:

         lastLogged map[multipartRequestKey]time.Time
         lastLoggedMtx sync.Mutex
         lastLoggedTidied time.Time

add a *handler method (logIsRedundant?) that locks the mutex and looks up / updates the relevant entry in h.lastLogged, and periodically deletes old entries with something similar to the lockTidied code (the existing h.lock = map[string]... code is also a good example of a convenient place to initialize the empty map)

near the top of logUploadOrDownload(), after user UUID and collection UUID/PDH are known,

if r.Method == "GET" && h.logIsRedundant(filepath, collection, user, time.Now()) { return }

(I'm thinking adding an explicit time argument would make it easier to test without a lot of time.Sleep())

Note logUploadAndDownload is also responsible for logging "file download" messages to stderr even when WebDAVLogEvents is turned off. I'm assuming we want those to be deduplicated as well.

Actions

Copy link

Updated by Peter Amstutz 4 months ago

Target version changed from Development 2024-07-24 sprint to Development 2024-08-07 sprint

Actions

Copy link

Updated by Brett Smith 3 months ago

Status changed from New to In Progress

Actions

Copy link

#10

Updated by Brett Smith 3 months ago

One consequence of the design as written is: imagine a scenario where multiple users behind NAT download the same file anonymously. Because they look identical to keep-web (using the same client API token and IP address), logs for these download events will be squashed, even though they are in fact multiple downloads.

I think this problem is difficult to solve in the general case, but do we want to implement some sort of heuristic to prevent it? Maybe re-log if the same client requests the same range within the time interval? That's not foolproof but it would cover cases like "two users behind NAT download the same file at the same time with a same-enough version of aws s3 cp."

Actions

Copy link

#11

Updated by Brett Smith 3 months ago

21901-file-log-throttling @ 4fa001b0b9af167667130b2336aca1c2bce15a74 - developer-run-tests: #4370

The existing code already had two parallel code paths for logging: one for local logging, and another for Arvados API logging. I tried to clean this up by having the key struct store all the data we want to log, and then export that for local logging and/or Arvados API logging as desired. This both reduces repetetion and means that if we want to log more details in the future, those details will, by default, feed into the log deduplication functionality.

All agreed upon points are implemented / addressed.
- Yes
Anything not implemented (discovered or discussed during work) has a follow-up story.
- N/A
Code is tested and passing, both automated and manual, what manual testing was done is described
- Added lots of tests—see results above
Documentation has been updated.
- Yes
Behaves appropriately at the intended scale (describe intended scale).
- Improves scale by reducing logging overhead
Considered backwards and forwards compatibility issues between client and server.
- No change in API compatibility
Follows our coding standards and GUI style guidelines.
- Yes

Actions

Copy link

#12

Updated by Peter Amstutz 3 months ago

Target version changed from Development 2024-08-07 sprint to Development 2024-08-28 sprint

Actions

Copy link

#13

Updated by Tom Clegg 3 months ago

Brett Smith wrote in #note-10:

One consequence of the design as written is: imagine a scenario where multiple users behind NAT download the same file anonymously. Because they look identical to keep-web (using the same client API token and IP address), logs for these download events will be squashed, even though they are in fact multiple downloads.

The proposed "...and the requested range does not include the first byte of the file..." condition would cover the most obvious/common cases of that, and would also avoid dropping logs when a file is downloaded multiple times in a row by the same user. I'm guessing you didn't see that bit.

21901-file-log-throttling @ 4fa001b0b9af167667130b2336aca1c2bce15a74

TestLogXForwardedFor should test some values with host:port and [host]:port.

According to https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/X-Forwarded-For:

we should expect multiple X-Forwarded-For headers
the client of the proxy closest to the server (which is the one whose IP we want to log, unless we recognize it as a trusted proxy itself, which would require additional configuration which we probably don't want to explore right now) is listed last, not first

Rest LGTM, thanks.

Actions

Copy link

#14

Updated by Brett Smith 3 months ago

Tom Clegg wrote in #note-13:

The proposed "...and the requested range does not include the first byte of the file..." condition would cover the most obvious/common cases of that, and would also avoid dropping logs when a file is downloaded multiple times in a row by the same user. I'm guessing you didn't see that bit.

I did see it, my issue was more that it wasn't clear if #note-7 was intended to completely supersede the ticket description or not or not (e.g., they suggest keying on different sets of values). And, at the time, I didn't read the sequence carefully enough, and I thought requests past the first byte would never get logged, and that seemed worse.

Going over it again, I see I misunderstood. At the same time, even with this version, I'll point out a variant of the situation I was worried about is still possible: someone behind the NAT just needs to send their requests with Range: bytes=1-, and they'll get basically the whole file without being logged.

Of course, for sure, at this point we are well into "someone malicious is actively trying to avoid detection" territory. But, we do document WebDAV logging generally as an auditability option, so I do think it's worth at least thinking about this.

If we wanted to completely address this, I think what we would have to do is:

the values in the cache map are a struct that includes a timestamp and a slice of range value pairs
a download request gets logged, and a new cache value created, if any of the following are true:
- the key is not in the cache map
- the timestamp value in the cache map is past the configured time interval
- there is any overlap between the range of this request and the range value pairs in the slice
otherwise, the cache map value gets updated by appending this request's range to the slice

Is this reasonable? Worth doing?

Actions

Copy link

#15

Updated by Peter Amstutz 3 months ago

Brett Smith wrote in #note-14:

there is any overlap between the range of this request and the range value pairs in the slice

otherwise, the cache map value gets updated by appending this request's range to the slice

Wouldn't the second download then trigger a log on every range request, because they had already been requested?

I think Tom's original heuristic of "record log when the client is new, or the request includes byte 0" is reasonable.

Honestly, I'd be fine if it didn't take range into account at all, and it only chooses to create a log depending on whether there's been a request from the same client for the same file in the last N seconds. From an auditing standpoint, we want to know the file access happened at all; exactly how many times it happened is less important.

(From a metrics standpoint it would be nice to have more precise numbers to measure demand for a particular file, but that isn't the goal of this ticket).

Actions

Copy link

#16

Updated by Brett Smith 3 months ago

Peter Amstutz wrote in #note-15:

Wouldn't the second download then trigger a log on every range request, because they had already been requested?

When I talk about ranges overlapping, what I mean is like:

Imagine the same client requests bytes 0-9, 20-29, and 10-19. None of these ranges overlap, so only the first request gets logged.

Now imagine a different same client requests bytes 0-9, 20-29, and 5-15. The first request would be logged. The second request would not because there's no range overlap. The third request would get logged, and create a new cache entry, because there's overlap between the previous range 0-9 and the request's range 5-15.

From an auditing standpoint, we want to know the file access happened at all; exactly how many times it happened is less important.

This is a totally reasonable take and I don't have any problem with it as a logging philosophy, I just wonder whether our users will agree this meets their logging needs. Which is why I'm sensitive to the fact that we specifically document this as an auditing log. If it was just a general informational log I'd care less. But we specifically present these logs as part of a file access auditing solution and security departments might have pretty strict expectations of what that means.

I guess we can just cover this in documentation of the option? Be very explicit that multiple requests within WebDAVLogDownloadInterval might not be logged, and therefore set it accordingly? I'd be fine with the byte 0 heuristic+that punt.

Actions

Copy link

#17

Updated by Tom Clegg 3 months ago

With the "always log when byte 0 is being downloaded" change, the only situation where we fail to log a download is where:

it's not a complete download -- perhaps the client skips the first byte specifically to reduce the probability of getting logged? -- and
we already logged an identical download event <30 seconds ago when an indistinguishable client at the same IP address (or behind the same NAT gateway) downloaded the same file using the same token (or both accesses were anonymous/unauthenticated).

I guess we could also note that even with the deduplication feature disabled entirely, we still can't guarantee logs for multiple identical downloads from behind someone else's network device, because caching proxy servers exist.

Actions

Copy link

#18

Updated by Tom Clegg 3 months ago

Brett Smith wrote in #note-14:

If we wanted to completely address this, I think what we would have to do is:
...

there is any overlap between the range of this request and the range value pairs in the slice
...

Is this reasonable? Worth doing?

I don't think it's worth tracking all of the ranges that have been downloaded in order to guarantee a log when any byte is downloaded twice. I think the "always log if byte 0 is downloaded" rule is enough.

Maybe we need to clarify in the docs that this is not a suitable tool for detecting unauthorized access, but rather a tool to track normal access by authorized users.

Actions

Copy link

#19

Updated by Brett Smith 3 months ago

Background on Range header parsing: in Go's http package this is handled in fs.go by the parseRange function. This is called from inside ServeContent and neither the function nor any of its results are exported anywhere we can see them.

The third-party http_range package is literally just a copy+paste of the stdlib code with public exports. On the one hand, that feels silly, but on the other hand, I would like to stay as close as possible to that code to make sure we have consistency between our logging logic and our serving logic. So, I guess I'll just use that rather than copy+pasting it ourselves.

Actions

Copy link

#20

Updated by Brett Smith 3 months ago

21901-file-log-throttling @ 5b1f5fec91615f987b6451c128f26a26f41d6644 - developer-run-tests: #4402

Implementing the "does the request get byte 0" heuristic requires parsing the Range header. In order to avoid inconsistent logic within keep-web, I wanted to use the same parser that Go's HTTP server itself uses. This required pulling in a new module; see #note-19.

In addition, the parsing code wants to know the size of the file being requested, because it standardizes the requested ranges to absolute byte positions. This means, assuming I stick to the above, I can't do any Range header parsing if the file doesn't exist or otherwise can't determine its size. I erred on the side of logging in this case. (A koan: is it meaningful to say a client requested byte 0 of a file that doesn't exist?)

I also noticed, even in the "normal" case of aws s3 cp blasting a bunch of non-overlapping partial content requests in parallel, there's a decent chance this logic means one logical "request" will be logged twice: once for whatever GET comes in first, and again for whatever GET requests byte 0, if they happen not to be the same request. I don't see any way to prevent that that doesn't implement the level of tracking that I proposed earlier, so I assume we'll just live with this, even though it could come off as weirdly inconsistent to administrators.

TestLogXForwardedFor should test some values with host:port and [host]:port.

The MDN documentation you linked never says that the port can appear in a valid value. It doesn't even mention the word "port." So I don't see any reason to do this.

we should expect multiple X-Forwarded-For headers

I believe this is covered by TestLogXForwardedForMultivalue.

Actions

Copy link

#21

Updated by Tom Clegg 3 months ago

Brett Smith wrote in #note-20:

I also noticed, even in the "normal" case of aws s3 cp blasting a bunch of non-overlapping partial content requests in parallel, there's a decent chance this logic means one logical "request" will be logged twice: once for whatever GET comes in first, and again for whatever GET requests byte 0, if they happen not to be the same request. I don't see any way to prevent that that doesn't implement the level of tracking that I proposed earlier, so I assume we'll just live with this, even though it could come off as weirdly inconsistent to administrators.

Yes, I think that's fine. "Sometimes 1 redundant log in same interval" isn't perfect, but it's better than "usually lots of redundant logs". And anything parsing the logs needs to accommodate dups anyway, because some multi-request downloads take longer than WebDAVLogDownloadInterval.

TestLogXForwardedFor should test some values with host:port and [host]:port.

The MDN documentation you linked never says that the port can appear in a valid value. It doesn't even mention the word "port." So I don't see any reason to do this.

Sorry, I misread the use of SplitHostPort -- we only use that on the req RemoteAddr, not the X-Forwarded-For header values.

(The Forwarded header, which I guess we'll convert to someday, does sometimes include ports.)

we should expect multiple X-Forwarded-For headers

I believe this is covered by TestLogXForwardedForMultivalue.

Ah, yes.

LGTM, thanks.

Actions

Copy link

#22

Updated by Brett Smith 3 months ago

Status changed from In Progress to Resolved

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Arvados

Custom queries

Bug #21901

Reduce redundant file_download logging when client uses multi-threaded/reassembly approach

Background¶

Proposed behavior¶

Updated by Tom Clegg 5 months ago

Updated by Tom Clegg 5 months ago

Updated by Peter Amstutz 5 months ago

Updated by Peter Amstutz 5 months ago

Updated by Peter Amstutz 5 months ago

Updated by Peter Amstutz 5 months ago

Updated by Tom Clegg 4 months ago

Updated by Peter Amstutz 4 months ago

Updated by Brett Smith 3 months ago

Updated by Brett Smith 3 months ago

Updated by Brett Smith 3 months ago

Updated by Peter Amstutz 3 months ago

Updated by Tom Clegg 3 months ago

Updated by Brett Smith 3 months ago

Updated by Peter Amstutz 3 months ago

Updated by Brett Smith 3 months ago

Updated by Tom Clegg 3 months ago

Updated by Tom Clegg 3 months ago

Updated by Brett Smith 3 months ago

Updated by Brett Smith 3 months ago

Updated by Tom Clegg 3 months ago

Updated by Brett Smith 3 months ago