Feature #8019
closed[Crunch2] Throttle logs
Description
Implement log throttling in crunch-run. Behavior should be equivalent to services/api/lib/crunch_dispatch.rb#rate_limit
Use same configuration parameters as currently in API server config:
# These two settings control how frequently log events are flushed to the # database. Log lines are buffered until either crunch_log_bytes_per_event # has been reached or crunch_log_seconds_between_events has elapsed since # the last flush. crunch_log_bytes_per_event: 4096 crunch_log_seconds_between_events: 1 # The sample period for throttling logs, in seconds. crunch_log_throttle_period: 60 # Maximum number of bytes that job can log over crunch_log_throttle_period # before being silenced until the end of the period. crunch_log_throttle_bytes: 65536 # Maximum number of lines that job can log over crunch_log_throttle_period # before being silenced until the end of the period. crunch_log_throttle_lines: 1024 # Maximum bytes that may be logged by a single job. Log bytes that are # silenced by throttling are not counted against this total. crunch_limit_log_bytes_per_job: 67108864
Above parameters should be published in the discovery document for use by crunch-run.
Updated by Tom Morris over 7 years ago
- Description updated (diff)
- Target version set to 2017-04-26 sprint
- Story points set to 2.0
- Release deleted (
11)
Updated by Radhika Chippada over 7 years ago
- Assigned To set to Radhika Chippada
Updated by Radhika Chippada over 7 years ago
- Status changed from New to In Progress
Updated by Radhika Chippada over 7 years ago
8019-crunchrun-log-throttle branch at 9dabca0eedbc9f842d542fea3463a441140d590c
Updated by Peter Amstutz over 7 years ago
Reviewing 8019-crunchrun-log-throttle @ 9dabca0eedbc9f842d542fea3463a441140d590c
- crunch-run needs to work with API servers that don't publish the
crunchLog*
settings. Provide defaults. It would also be better to copy the settings out of the discovery document just once into fields on ThrottledLogger/ArvLogWriter instead of looking them up from the discovery document every time.
- The value used to initialize
time.NewTicker
should becrunchLogSecondsBetweenEvents
logging.go:84:func (tl *ThrottledLogger) flusher() { ticker := time.NewTicker(tl.crunchLogSecondsBetweenEvents)
- Rename the
ThrottledLogger.stopping
channel toThrottledLogger.flush
.
logging.go:88:case _, open := <-tl.flush: // if closed, flush tl.buf, then exit the loop stopping = !open
ThrottledLogger.Write
should checkcrunchLogBytesPerEvent
and flush when exceeded.
logging.go:78:if tl.buf >= tl.crunchLogBytesPerEvent { tl.flush <- struct{} }
- This should reuse the variable 'now' that you've already stored previously.
logging.go:220:arvlog.logThrottleResetTime = time.Now().Add(time.Second * time.Duration(int(crunchLogThrottlePeriod.(float64))))
- There's no point in having
ArvLogWriter.rateLimit
return the value oflogThrottleIsOpen
(the first bool) if it is just discarded (L242)
- Similarly,
ArvLogWriter.rateLimit
only return an error when checkingarvlog.ArvClient.Discovery()
but it shouldn't be doing that (1st comment) so it doesn't need to returnerr
either.
- Please rename
stderrBufToFlush
andstderrFlushedAt
tobufToFlush
andflushedAt
because they are not stderr-specific.
- This regular expression is wrong for crunch-run logs.
var lineRegexp = regexp.MustCompile(`^\S+ \S+ \d+ \d+ stderr (.*)`)
Should bevar lineRegexp = regexp.MustCompile(`^\S+ (.*)`)
- Consider passing in a single value for
now
toArvLogWriter.rateLimit()
instead of callingtime.Now()
many times
Updated by Peter Amstutz over 7 years ago
To amend my previous comment, from crunch_dispatch.rb:
# rate_limit returns true or false as to whether to actually log # the line or not. It also modifies "line" in place to replace # it with an error if a logging limit is tripped.
logging.go is ignoring the return value of ArvLogWriter.rateLimit()
, but it shouldn't.
Updated by Radhika Chippada over 7 years ago
Peter: Addressed all your comments at c5c09df, except "ThrottledLogger.Write should check crunchLogBytesPerEvent and flush when exceeded"
When I try to add this to ThrottledLogger.Write, the tests are hanging (I ran all the tests in crunch-run and do not know which specific test(s) are hanging). Please advise.
Updated by Radhika Chippada over 7 years ago
- Target version changed from 2017-04-26 sprint to 2017-05-10 sprint
- Story points changed from 2.0 to 0.0
Updated by Peter Amstutz over 7 years ago
8019-crunchrun-log-throttle @ c5c09df38966595b4f27c402d1e9ae5500d6d201
- Is there a particular reason to skip empty lines?
logging.go:239:} else if len(line) == 0 { continue }
- Can you update this comment to say "once per crunchLogSecondsBetweenEvents"
logging.go:161:// (b) batches log messages and only calls the underlying Writer at most once // per second.
- It already sets
arvlog.logThrottleIsOpen
inside rateLimit. The return value is supposed to be whether to log that particular line or not. So this should be changed from:
logging.go:244:logOpen, msg := arvlog.rateLimit(line, now) arvlog.bufToFlush.WriteString(string(msg) + "\n") rvlog.logThrottleIsOpen = logOpen
tologOpen, msg := arvlog.rateLimit(line, now) if logOpen { arvlog.bufToFlush.WriteString(string(msg) + "\n") }
- The logic for "Rate-limiting partial segments of long lines" doesn't make sense. I think you want remove the "partialLine" and "skipCounts" flags and adjust the if statement (this works in conjunction with the previous point, where returning false suppresses just the current line instead of closing the throttle entirely):
logging.go:295:if now.After(arvlog.logThrottlePartialLineLastAt.Add(time.Second * time.Duration(crunchLogPartialLineThrottlePeriod))) { arvlog.logThrottlePartialLineLastAt = now } else { return false, line }
- You mentioned tests are deadlocking when you write to the flush channel from ThrottledLogger.Write. I think I know why. If that method gets called multiple times, it would send the flush signal multiple times. Since the channel size is 1, it blocks after the 1st send. So you need an additional flag:
type ThrottledLogger struct { ... pendingFlush bool }
func (tl *ThrottledLogger) Write(p []byte) (n int, err error) { ... if len(tl.buf) >= crunchLogBytesPerEvent && !tl.pendingFLush { tl.pendingFlush = true tl.flush <- struct{} } }
func (tl *ThrottledLogger) flusher() { ... tl.Mutex.Lock() ready, tl.buf = tl.buf, nil tl.pendingFlush = false tl.Mutex.Unlock() }
Updated by Radhika Chippada over 7 years ago
Is there a particular reason to skip empty lines?
This is how we are doing in the original code (crunch_dispatch.rb) as well. Removing this is resulting an extra empty line at the end of log file, which would need updating the tests. I am not sure if empty lines are desired. Left this as is.
Can you update this comment to say "once per crunchLogSecondsBetweenEvents"
Done
It already sets arvlog.logThrottleIsOpen inside rateLimit. The return value is supposed to be whether to log that particular line or not. So this should be changed from ...
Done
The logic for "Rate-limiting partial segments of long lines" doesn't make sense. I think you want remove the "partialLine" and "skipCounts" flags ...
Removed skipCounts, but need partialLine flag to ensure we only log the message once per "Rate-limiting partial segments of long lines to one every n seconds". Also fixed an issue in block at line 301 in this regards
You mentioned tests are deadlocking when you write to the flush channel from ThrottledLogger.Write. I think I know why. If that method gets called multiple times, it would send the flush signal multiple times. Since the channel size is 1, it blocks after the 1st send. So you need an additional flag "pendingFlush"
Even after adding this the tests are "taking too long". On further debugging, I narrowed it down to the test "TestWriteLogsLarge". It is basically flooding the log and causing many flusher events from tl.Write func. Commented out this logic for now. It does not appear that we want to do this for the case where we have log flooding situation. Do we instead want to set the ticker in flusher func back to 1 or a factor of crunchLogSecondsBetweenEvents, whichever is greater?
Updated by Radhika Chippada over 7 years ago
At e54bce82
Two tests "TestCrunchStat" and "TestNodeInfoLog" were failing. This was due to rate limiting alone causing loss of log data (delaying writing to server until the bufToFlush or bufFlushedAt conditions were exceeded). The following update was needed.
- if int64(arvlog.bufToFlush.Len()) > crunchLogBytesPerEvent || - (now.Sub(arvlog.bufFlushedAt) >= crunchLogSecondsBetweenEvents) { + if (int64(arvlog.bufToFlush.Len()) > crunchLogBytesPerEvent || + (now.Sub(arvlog.bufFlushedAt) >= crunchLogSecondsBetweenEvents) || + arvlog.closing) && (arvlog.bufToFlush.Len() > 0) {
However, this required commenting out line 638 in crunchrun_test.rb . With this line in, TestWriteLogsLarge is failing (without this test, no issues and all other tests pass). The test suite panicking because there is no api.Calls in api.Content
c.Check(api.Content[api.Calls-1]["container"].(arvadosclient.Dict)["log"], NotNil)
Please take a look and comment as to how to proceed. Thanks
Updated by Radhika Chippada over 7 years ago
Error log:
********** Running services/crunch-run tests ********** 2017/05/03 13:46:14 [notice] 12463#12463: using the "epoll" event method 2017/05/03 13:46:14 [notice] 12463#12463: nginx/1.10.2 2017/05/03 13:46:14 [notice] 12463#12463: OS: Linux 4.8.0-49-generic 2017/05/03 13:46:14 [notice] 12463#12463: getrlimit(RLIMIT_NOFILE): 1024:65536 2017/05/03 13:46:14 [notice] 12463#12463: start worker processes 2017/05/03 13:46:14 [notice] 12463#12463: start worker process 12477 SIGQUIT: quit PC=0x45cb21 m=0 sigcode=0 goroutine 0 [idle]: runtime.futex(0xc1fd70, 0x0, 0x0, 0x0, 0x0, 0xbb68eef0, 0x0, 0x0, 0x7ffcbb68ef00, 0x40fbcb, ...) /usr/lib/go-1.8/src/runtime/sys_linux_amd64.s:422 +0x21 runtime.futexsleep(0xc1fd70, 0x0, 0xffffffffffffffff) /usr/lib/go-1.8/src/runtime/os_linux.go:45 +0x62 runtime.notesleep(0xc1fd70) /usr/lib/go-1.8/src/runtime/lock_futex.go:145 +0x6b runtime.stopm() /usr/lib/go-1.8/src/runtime/proc.go:1650 +0xad runtime.findrunnable(0xc42001e000, 0x0) /usr/lib/go-1.8/src/runtime/proc.go:2102 +0x2e4 runtime.schedule() /usr/lib/go-1.8/src/runtime/proc.go:2222 +0x14c runtime.park_m(0xc420060340) /usr/lib/go-1.8/src/runtime/proc.go:2285 +0xab runtime.mcall(0x7ffcbb68f090) /usr/lib/go-1.8/src/runtime/asm_amd64.s:269 +0x5b goroutine 1 [chan receive, 10 minutes]: testing.(*T).Run(0xc420067a00, 0x8365cb, 0xe, 0x84b1f0, 0xc4201afd20) /usr/lib/go-1.8/src/testing/testing.go:698 +0x2f4 testing.runTests.func1(0xc420067a00) /usr/lib/go-1.8/src/testing/testing.go:882 +0x67 testing.tRunner(0xc420067a00, 0xc4201afde0) /usr/lib/go-1.8/src/testing/testing.go:657 +0x96 testing.runTests(0xc420294540, 0xbea770, 0x1, 0x1, 0x20) /usr/lib/go-1.8/src/testing/testing.go:888 +0x2c1 testing.(*M).Run(0xc420501f20, 0xc4201aff20) /usr/lib/go-1.8/src/testing/testing.go:822 +0xfc main.main() _/home/radhika/arvados/services/crunch-run/_test/_testmain.go:42 +0xf7 goroutine 17 [syscall, 10 minutes, locked to thread]: runtime.goexit() /usr/lib/go-1.8/src/runtime/asm_amd64.s:2197 +0x1 goroutine 4 [syscall, 10 minutes]: os/signal.signal_recv(0x0) /usr/lib/go-1.8/src/runtime/sigqueue.go:116 +0x104 os/signal.loop() /usr/lib/go-1.8/src/os/signal/signal_unix.go:22 +0x22 created by os/signal.init.1 /usr/lib/go-1.8/src/os/signal/signal_unix.go:28 +0x41 goroutine 5 [chan receive, 10 minutes]: gopkg.in/check%2ev1.(*suiteRunner).runTest(0xc4201cdd80, 0xc420308620, 0x0) /tmp/tmp.B1XT4Ndn22/GOPATH/src/gopkg.in/check.v1/check.go:810 +0x61 gopkg.in/check%2ev1.(*suiteRunner).run(0xc4201cdd80, 0xc3b8a0) /tmp/tmp.B1XT4Ndn22/GOPATH/src/gopkg.in/check.v1/check.go:615 +0xf4 gopkg.in/check%2ev1.Run(0x7eb320, 0xc3b8a0, 0xc4202113c0, 0xc4200a83f0) /tmp/tmp.B1XT4Ndn22/GOPATH/src/gopkg.in/check.v1/run.go:92 +0x4d gopkg.in/check%2ev1.RunAll(0xc4202113c0, 0x7ffcbb68ffb4) /tmp/tmp.B1XT4Ndn22/GOPATH/src/gopkg.in/check.v1/run.go:84 +0xed gopkg.in/check%2ev1.TestingT(0xc420067ad0) /tmp/tmp.B1XT4Ndn22/GOPATH/src/gopkg.in/check.v1/run.go:72 +0x3f7 _/home/radhika/arvados/services/crunch-run.TestCrunchExec(0xc420067ad0) /home/radhika/arvados/services/crunch-run/crunchrun_test.go:38 +0x2b testing.tRunner(0xc420067ad0, 0x84b1f0) /usr/lib/go-1.8/src/testing/testing.go:657 +0x96 created by testing.(*T).Run /usr/lib/go-1.8/src/testing/testing.go:697 +0x2ca goroutine 6 [select, 10 minutes]: gopkg.in/check%2ev1.(*resultTracker)._loopRoutine(0xc4200a8480) /tmp/tmp.B1XT4Ndn22/GOPATH/src/gopkg.in/check.v1/check.go:461 +0x416 created by gopkg.in/check%2ev1.(*resultTracker).start /tmp/tmp.B1XT4Ndn22/GOPATH/src/gopkg.in/check.v1/check.go:441 +0x3f goroutine 7 [chan receive, 9 minutes]: _/home/radhika/arvados/services/crunch-run.(*ThrottledLogger).Close(0xc4200620a0, 0xc420503aa8, 0x1) /home/radhika/arvados/services/crunch-run/logging.go:128 +0x78 _/home/radhika/arvados/services/crunch-run.(*LoggingTestSuite).TestWriteLogsLarge(0xc3b8a0, 0xc4201da870) /home/radhika/arvados/services/crunch-run/logging_test.go:68 +0x31b reflect.Value.call(0x7eb320, 0xc3b8a0, 0x613, 0x8326e0, 0x4, 0xc42003df70, 0x1, 0x1, 0xc3cf60, 0x82a960, ...) /usr/lib/go-1.8/src/reflect/value.go:434 +0x91f reflect.Value.Call(0x7eb320, 0xc3b8a0, 0x613, 0xc42003df70, 0x1, 0x1, 0x0, 0xc42003df07, 0xc4200b4820) /usr/lib/go-1.8/src/reflect/value.go:302 +0xa4 gopkg.in/check%2ev1.(*suiteRunner).forkTest.func1(0xc4201da870) /tmp/tmp.B1XT4Ndn22/GOPATH/src/gopkg.in/check.v1/check.go:772 +0x643 gopkg.in/check%2ev1.(*suiteRunner).forkCall.func1(0xc4201cdd80, 0xc4201da870, 0xc420295180) /tmp/tmp.B1XT4Ndn22/GOPATH/src/gopkg.in/check.v1/check.go:666 +0x7c created by gopkg.in/check%2ev1.(*suiteRunner).forkCall /tmp/tmp.B1XT4Ndn22/GOPATH/src/gopkg.in/check.v1/check.go:667 +0x230 goroutine 8 [chan receive, 9 minutes]: _/home/radhika/arvados/services/crunch-run.(*CollectionFileWriter).goUpload(0xc420062050, 0xc420306ea0) /home/radhika/arvados/services/crunch-run/upload.go:97 +0xe6 created by _/home/radhika/arvados/services/crunch-run.(*CollectionWriter).Open /home/radhika/arvados/services/crunch-run/upload.go:170 +0x323 goroutine 9 [select]: _/home/radhika/arvados/services/crunch-run.(*ThrottledLogger).flusher(0xc4200620a0) /home/radhika/arvados/services/crunch-run/logging.go:99 +0x28d created by _/home/radhika/arvados/services/crunch-run.NewThrottledLogger /home/radhika/arvados/services/crunch-run/logging.go:181 +0x180 rax 0xca rbx 0x0 rcx 0x45cb23 rdx 0x0 rdi 0xc1fd70 rsi 0x0 rbp 0x7ffcbb68eed0 rsp 0x7ffcbb68ee88 r8 0x0 r9 0x0 r10 0x0 r11 0x286 r12 0xc42001a2c0 r13 0xc42034a038 r14 0xc420306f00 r15 0x0 rip 0x45cb21 rflags 0x286 cs 0x33 fs 0x0 gs 0x0 *** Test killed with quit: ran too long (10m0s). FAIL _/home/radhika/arvados/services/crunch-run 600.042s
Updated by Radhika Chippada over 7 years ago
- Status changed from In Progress to Resolved
- % Done changed from 0 to 100
Applied in changeset arvados|commit:3b4325c210516d1f61838fb26e06e0a11c31ce6d.