Project

General

Profile

Actions

Bug #21748

open

awscli downloads from keep-web slowly?

Added by Peter Amstutz about 2 months ago. Updated 13 days ago.

Status:
In Progress
Priority:
Normal
Assigned To:
Category:
Keep
Story points:
-

Subtasks 1 (1 open0 closed)

Task #21756: ReviewNewLucas Di PentimaActions

Related issues

Related to Arvados - Bug #21901: Reduce redundant file_download logging when client uses multi-threaded/reassembly approachNewBrett SmithActions
Related to Arvados - Bug #21904: Remove feature that rejects POST arvados/v1/logs immediately when any requests are queuedNewActions
Related to Arvados - Bug #21907: Uncached permission lookup in keep-web when handling s3 requestNewBrett SmithActions
Actions #1

Updated by Peter Amstutz about 2 months ago

  • Assigned To set to Tom Clegg
Actions #2

Updated by Peter Amstutz about 1 month ago

  • Status changed from New to In Progress
Actions #3

Updated by Peter Amstutz about 1 month ago

  • Target version changed from Development 2024-05-22 sprint to Development 2024-06-05 sprint
Actions #4

Updated by Peter Amstutz 28 days ago

  • Target version changed from Development 2024-06-05 sprint to Development 2024-06-19 sprint
Actions #5

Updated by Peter Amstutz 19 days ago

A couple theories

  • inefficient manifest parsing (#21891) (but it looks like the delay is on keepstore)
  • logging that the user accessed the file on the 'logs' table
  • checking tokens (TBD)
Actions #6

Updated by Peter Amstutz 19 days ago

Findings:

  • Setting Collections.WebDAVLogEvents: false significantly improves performance
  • It appears that the POST requests for creating the "file_download" events are returning 503
  • Parallel downloads result in the creation of multiple "file_download" events, which are redundant for reporting purposes

Current hypothesis:

Despite the fact that the POST requests to create "file_download" are run in a goroutine (and thus shouldn't block the main request thread), receiving a 503 may be triggering global throttling in the Arvados client. However, keepstore doesn't use the Arvados client for block fetching, so we don't yet know what is causing the slowdown -- Tom suggested maybe it is making a request to the API server to verify that the collection is up-to-date.

Possible fixes:

  • We need to determine if it is subject to global throttling, and if so, what API call on the main request thread is actually blocking.
    • If it is checking that the collection is up to date on every request, it shouldn't do that, it should have a TTL
  • Arvados client could specifically identify "log create" events as something that should not trigger global throttling
  • It shouldn't be posting repeated "file_download" events, it should keep track of what events have been posted in the last N seconds and suppress redundant logging. (The "file_download" events are audit logs mainly intended to inform humans, so there is no reason to log every since access, just the fact that the user accessed it at all in a particular period, say 1 hour).

In addition:

  • Configuring the aws cli to download in 64 MB chunks instead of 8 MB chunks (default) significantly improves performance.

Current hypothesis:

aws cli sending several requests for 8 MB chunks but they end up mostly waiting on the same block, because keep-web to keepstore is requesting one 64 MB block.

However, keepstore to the S3 backend is also using a multi-chunk request, and starts sending data as soon as the first chunk is available. The current behavior is 11 parallel requests (each fetching about 6 MB of data) which was observed optimal behavior with actual Amazon S3, but other S3-like services may have different preferred behavior patterns. For example, with a different service it might be faster to do fewer or just a single request to get a 64 MB block.

We don't have an option to adjust the parallel request fetching from S3.

We can set an output buffer in keep-web, which enables keep-web to pull data a bit faster than the client receives it in order to hide request latency, but only for data the client has requested. The default is 0, we don't know if would help, we didn't try it.

Actions #7

Updated by Tom Clegg 19 days ago

The Go client "pause all requests after 503 error" behavior should not delay higher priority requests (like "get collection PDH") when only low-priority requests (like "write log entry") are returning 503.

Now that container logs do not use the POST /arvados/v1/logs endpoint, we should reconsider whether we still want controller to reject such requests with 503 immediately when the incoming request queue is non-empty. Perhaps it would be better to queue with normal priority -- that way, the "log upload/download" requests would have waited in the queue instead of giving up immediately. Otherwise, even if we fix the "log N downloads when client downloads a file in N parts" and "503 causes non-logging requests to be delayed" issues, the upload/download logging will effectively be suspended while controller is busy, which is probably not expected/desired.

We (and Roche) could check whether the "check current PDH" API call is the one causing the delays by doing trials with a PDH instead of a UUID.

Tangentially, we also noticed that keepstore logs do not include the supplied token UUID.

Actions #8

Updated by Peter Amstutz 18 days ago

From discussion at standup:

We want to do 3 things

  • need to confirm that the hypothesis that a log event 503 error is trigging global throttling, and identify exactly what operation is being throttled. Probably want to reproduce this by artificially making the log POST get a 503 error.
  • keep-web to track recent download requests by (file, API token, client provided as X-Real-IP or X-Forwarded-For) and implement a quiet period (e.g. 5 minutes) where it won't post additional redundant file_download events. Although, a large file download that takes more than 5 minutes would still post multiple download events, so a better strategy might be to have a shorter quiet period (1 minute) but reset the timeout each time so it still only posts one event.
  • Tom to write a ticket proposing changes to the controller behavior so we no longer aggressively reject log events when there's any other events queued, but instead just queue them at low priority
Actions #9

Updated by Tom Clegg 18 days ago

  • Related to Bug #21901: Reduce redundant file_download logging when client uses multi-threaded/reassembly approach added
Actions #10

Updated by Tom Clegg 15 days ago

  • Related to Bug #21904: Remove feature that rejects POST arvados/v1/logs immediately when any requests are queued added
Actions #11

Updated by Tom Clegg 15 days ago

This token check in checks3signature arvados:source:services/keep-web/s3.go (only in the s3 code path) could be the API call being delayed by the quiet-after-503 period.

        if len(key) == 27 && key[5:12] == "-gj3su-" {
                // Access key is the UUID of an Arvados token, secret                                                                                         
                // key is the secret part.                                                                                                                    
                ctx := arvados.ContextWithAuthorization(r.Context(), "Bearer "+h.Cluster.SystemRootToken)
                err = client.RequestAndDecodeContext(ctx, &aca, "GET", "arvados/v1/api_client_authorizations/"+key, nil, nil)
                secret = aca.APIToken
        } else {
                // Access key and secret key are both an entire                                                                                               
                // Arvados token or OIDC access token.                                                                                                        
                ctx := arvados.ContextWithAuthorization(r.Context(), "Bearer "+unescapeKey(key))
                err = client.RequestAndDecodeContext(ctx, &aca, "GET", "arvados/v1/api_client_authorizations/current", nil, nil)
                secret = key
        }

This uuid->secret lookup should be cached, respecting the Collections.WebDAVCache.TTL config.

Actions #12

Updated by Tom Clegg 15 days ago

  • Related to Bug #21907: Uncached permission lookup in keep-web when handling s3 request added
Actions #13

Updated by Peter Amstutz 13 days ago

  • Target version changed from Development 2024-06-19 sprint to Development 2024-07-03 sprint
Actions

Also available in: Atom PDF