Project

General

Profile

Actions

Bug #20200

closed

controller limits log create requests

Added by Peter Amstutz about 1 year ago. Updated about 1 year ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Crunch
Story points:
-
Release relationship:
Auto

Description

Explicitly limit log create requests in controller so they do not overwhelm the API server.

When there are many containers running, they can overwhelm the API server sending live log updates. It is an acceptable tradeoff to discard logs in order to maintain system stability, so the goal here is to limit the amount of logs we process in order to leave capacity for all the other requests.


Subtasks 1 (0 open1 closed)

Task #20201: Review 20200-log-limitResolvedPeter Amstutz03/03/2023Actions

Related issues

Precedes Arvados - Idea #20258: "Live" log view should show users the most recent logs, wherever those areResolved03/06/202303/06/2023Actions
Actions #1

Updated by Peter Amstutz about 1 year ago

  • Description updated (diff)
Actions #2

Updated by Peter Amstutz about 1 year ago

  • Description updated (diff)
  • Subject changed from crunch-run better handling of live logging to controller limits log create requests
Actions #3

Updated by Peter Amstutz about 1 year ago

  • Assigned To set to Peter Amstutz
  • Status changed from New to In Progress
Actions #4

Updated by Peter Amstutz about 1 year ago

20200-log-limit @ b5874d1fae46eef00662d74fe25e6841bec68650

  • Add LogCreateRequestFraction config option
  • Add limitLogCreateRequests middleware to controller Handler
  • Use a go channel to keep a counter of in-flight log create requests, and return early with a 503 error when the channel is saturated
  • Add a test for this behavior

developer-run-tests: #3527

Actions #5

Updated by Peter Amstutz about 1 year ago

  • Description updated (diff)
Actions #6

Updated by Tom Clegg about 1 year ago

Changing default from unlimited to MaxConcurrentRequests: 64 may deserve a mention in upgrade notes.

Instead of implementing "unlimited" with an arbitrary limit of 4096, we could create the channel with length 0, and in limitLogCreateRequests say

if cap(h.limitLogCreate) > 0 && ...

Releasing the slot (<-h.limitLogCreate) should be in a defer, so it runs even if next.ServeHTTP() panics. (Go http server recovers from panicking handlers.)

Nit: Test would probably be easier to read if we just create one http.Request and use it N times, or make a tryLogging := func(expectSuccess bool) and call that N+1 times

Rest LGTM, thanks

Actions #9

Updated by Peter Amstutz about 1 year ago

Running this one more time

20200-log-limit @ 51b11d12c688dd9fae37f527d4c1ddf1e1a186f6

developer-run-tests: #3538

The sdk/go/keepclient test is passing when I run it locally, but seeing this error on Jenkins:

15:27:59 FAIL: keepclient_test.go:1087: ServerRequiredSuite.TestPutGetHead
15:27:59 
15:27:59 keepclient_test.go:1099:
15:27:59     c.Check(err, Equals, BlockNotFound)
15:27:59 ... obtained *keepclient.ErrNotFound = &keepclient.ErrNotFound{multipleResponseError:keepclient.multipleResponseError{error:(*errors.errorString)(0xc0014aafd0), isTemp:false}} ("HEAD a8383a37e70cc4c384694c32c3face5f failed: [http://localhost:33893/a8383a37e70cc4c384694c32c3face5f: HTTP 404 \"\" http://localhost:36625/a8383a37e70cc4c384694c32c3face5f: HTTP 405 \"\"]")
15:27:59 ... expected *keepclient.ErrNotFound = &keepclient.ErrNotFound{multipleResponseError:keepclient.multipleResponseError{error:(*errors.errorString)(0xc00020c5d0), isTemp:false}} ("Block not found")
15:27:59 ... Difference:
15:27:59 ...     multipleResponseError.error.s: "HEAD a8383a37e70cc4c384694c32c3face5f failed: [http://localhost:33893/a8383a37e70cc4c384694c32c3face5f: HTTP 404 \"\" http://localhost:36625/a8383a37e70cc4c384694c32c3face5f: HTTP 405 \"\"]" != "Block not found" 

This is weird because 405 is HTTP method not allowed. I'm wondering if it is accidentally talking to the service from RunFakeKeepServer (which returns StatusMethodNotAllowed for HEAD) and not the real keepstore. However, the keep0 logs show a "HEAD a8383a37e70cc4c384694c32c3face5f" request returning 404, but that could be from something else.

Actions #10

Updated by Peter Amstutz about 1 year ago

It looks like it is contacting two different services, one on port 33893 which returns 404 as expected and one on 36625. The keep0 log has these errors:

{"level":"warning","msg":"Volumes.zzzzz-nyw5e-000000000000001.AccessViaHosts refers to nonexistent keepstore server http://127.0.0.1:36625/","time":"2023-03-06T20:30:04.057455704Z"}

The keep test server sets up 4 keepstore servers at various ports, with a note about possible races and SO_REUSEADDR / SO_REUSEPORT.

Actions #11

Updated by Peter Amstutz about 1 year ago

Ah, actually, the last run (developer-run-tests-remainder: #3720 ) failed on a different test (services/keep-balance). So it is looking more likely that I'm just hitting several flaky tests in a row, not that this is a bug introduced by the branch.

Re-run again developer-run-tests-remainder: #3721

Actions #12

Updated by Peter Amstutz about 1 year ago

  • Release set to 57
Actions #13

Updated by Peter Amstutz about 1 year ago

  • Status changed from In Progress to Resolved
Actions #14

Updated by Brett Smith about 1 year ago

  • Precedes Idea #20258: "Live" log view should show users the most recent logs, wherever those are added
Actions

Also available in: Atom PDF