Bug #20200
closedcontroller limits log create requests
Description
Explicitly limit log create requests in controller so they do not overwhelm the API server.
When there are many containers running, they can overwhelm the API server sending live log updates. It is an acceptable tradeoff to discard logs in order to maintain system stability, so the goal here is to limit the amount of logs we process in order to leave capacity for all the other requests.
Updated by Peter Amstutz almost 2 years ago
- Description updated (diff)
- Subject changed from crunch-run better handling of live logging to controller limits log create requests
Updated by Peter Amstutz almost 2 years ago
- Assigned To set to Peter Amstutz
- Status changed from New to In Progress
Updated by Peter Amstutz almost 2 years ago
20200-log-limit @ b5874d1fae46eef00662d74fe25e6841bec68650
- Add LogCreateRequestFraction config option
- Add limitLogCreateRequests middleware to controller Handler
- Use a go channel to keep a counter of in-flight log create requests, and return early with a 503 error when the channel is saturated
- Add a test for this behavior
Updated by Tom Clegg almost 2 years ago
Changing default from unlimited to MaxConcurrentRequests: 64
may deserve a mention in upgrade notes.
Instead of implementing "unlimited" with an arbitrary limit of 4096, we could create the channel with length 0, and in limitLogCreateRequests say
if cap(h.limitLogCreate) > 0 && ...
Releasing the slot (<-h.limitLogCreate
) should be in a defer, so it runs even if next.ServeHTTP() panics. (Go http server recovers from panicking handlers.)
Nit: Test would probably be easier to read if we just create one http.Request and use it N times, or make a tryLogging := func(expectSuccess bool)
and call that N+1 times
Rest LGTM, thanks
Updated by Peter Amstutz almost 2 years ago
20200-log-limit @ 184d9fee6b72520be286b375ac569702246e224e
Updated by Peter Amstutz almost 2 years ago
20200-log-limit @ 51b11d12c688dd9fae37f527d4c1ddf1e1a186f6
Updated by Peter Amstutz almost 2 years ago
Running this one more time
20200-log-limit @ 51b11d12c688dd9fae37f527d4c1ddf1e1a186f6
The sdk/go/keepclient test is passing when I run it locally, but seeing this error on Jenkins:
15:27:59 FAIL: keepclient_test.go:1087: ServerRequiredSuite.TestPutGetHead 15:27:59 15:27:59 keepclient_test.go:1099: 15:27:59 c.Check(err, Equals, BlockNotFound) 15:27:59 ... obtained *keepclient.ErrNotFound = &keepclient.ErrNotFound{multipleResponseError:keepclient.multipleResponseError{error:(*errors.errorString)(0xc0014aafd0), isTemp:false}} ("HEAD a8383a37e70cc4c384694c32c3face5f failed: [http://localhost:33893/a8383a37e70cc4c384694c32c3face5f: HTTP 404 \"\" http://localhost:36625/a8383a37e70cc4c384694c32c3face5f: HTTP 405 \"\"]") 15:27:59 ... expected *keepclient.ErrNotFound = &keepclient.ErrNotFound{multipleResponseError:keepclient.multipleResponseError{error:(*errors.errorString)(0xc00020c5d0), isTemp:false}} ("Block not found") 15:27:59 ... Difference: 15:27:59 ... multipleResponseError.error.s: "HEAD a8383a37e70cc4c384694c32c3face5f failed: [http://localhost:33893/a8383a37e70cc4c384694c32c3face5f: HTTP 404 \"\" http://localhost:36625/a8383a37e70cc4c384694c32c3face5f: HTTP 405 \"\"]" != "Block not found"
This is weird because 405 is HTTP method not allowed. I'm wondering if it is accidentally talking to the service from RunFakeKeepServer (which returns StatusMethodNotAllowed for HEAD) and not the real keepstore. However, the keep0 logs show a "HEAD a8383a37e70cc4c384694c32c3face5f" request returning 404, but that could be from something else.
Updated by Peter Amstutz almost 2 years ago
It looks like it is contacting two different services, one on port 33893 which returns 404 as expected and one on 36625. The keep0 log has these errors:
{"level":"warning","msg":"Volumes.zzzzz-nyw5e-000000000000001.AccessViaHosts refers to nonexistent keepstore server http://127.0.0.1:36625/","time":"2023-03-06T20:30:04.057455704Z"}
The keep test server sets up 4 keepstore servers at various ports, with a note about possible races and SO_REUSEADDR / SO_REUSEPORT.
Updated by Peter Amstutz almost 2 years ago
Ah, actually, the last run (developer-run-tests-remainder: #3720 ) failed on a different test (services/keep-balance). So it is looking more likely that I'm just hitting several flaky tests in a row, not that this is a bug introduced by the branch.
Re-run again developer-run-tests-remainder: #3721
Updated by Peter Amstutz almost 2 years ago
- Status changed from In Progress to Resolved
Updated by Brett Smith almost 2 years ago
- Precedes Idea #20258: "Live" log view should show users the most recent logs, wherever those are added