Feature #8019

[Crunch2] Throttle logs

Added by Peter Amstutz almost 6 years ago. Updated over 4 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Radhika Chippada
Category:
-
Target version:
Start date:
04/25/2017
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
0.0

Description

Implement log throttling in crunch-run. Behavior should be equivalent to services/api/lib/crunch_dispatch.rb#rate_limit

Use same configuration parameters as currently in API server config:

  # These two settings control how frequently log events are flushed to the
  # database.  Log lines are buffered until either crunch_log_bytes_per_event
  # has been reached or crunch_log_seconds_between_events has elapsed since
  # the last flush.
  crunch_log_bytes_per_event: 4096
  crunch_log_seconds_between_events: 1

  # The sample period for throttling logs, in seconds.
  crunch_log_throttle_period: 60

  # Maximum number of bytes that job can log over crunch_log_throttle_period
  # before being silenced until the end of the period.
  crunch_log_throttle_bytes: 65536

  # Maximum number of lines that job can log over crunch_log_throttle_period
  # before being silenced until the end of the period.
  crunch_log_throttle_lines: 1024

  # Maximum bytes that may be logged by a single job.  Log bytes that are
  # silenced by throttling are not counted against this total.
  crunch_limit_log_bytes_per_job: 67108864

Above parameters should be published in the discovery document for use by crunch-run.


Subtasks

Task #11478: Review 8019-crunchrun-log-throttleResolvedPeter Amstutz

Associated revisions

Revision 3b4325c2
Added by Radhika Chippada over 4 years ago

closes #8019
Merge branch '8019-crunchrun-log-throttle'

History

#1 Updated by Tom Morris over 4 years ago

  • Description updated (diff)
  • Target version set to 2017-04-26 sprint
  • Story points set to 2.0
  • Release deleted (11)

#2 Updated by Radhika Chippada over 4 years ago

  • Assigned To set to Radhika Chippada

#3 Updated by Peter Amstutz over 4 years ago

  • Description updated (diff)

#4 Updated by Radhika Chippada over 4 years ago

  • Status changed from New to In Progress

#5 Updated by Radhika Chippada over 4 years ago

8019-crunchrun-log-throttle branch at 9dabca0eedbc9f842d542fea3463a441140d590c

#6 Updated by Peter Amstutz over 4 years ago

Reviewing 8019-crunchrun-log-throttle @ 9dabca0eedbc9f842d542fea3463a441140d590c

  • crunch-run needs to work with API servers that don't publish the crunchLog* settings. Provide defaults. It would also be better to copy the settings out of the discovery document just once into fields on ThrottledLogger/ArvLogWriter instead of looking them up from the discovery document every time.
  • The value used to initialize time.NewTicker should be crunchLogSecondsBetweenEvents
    logging.go:84:
    func (tl *ThrottledLogger) flusher() {
        ticker := time.NewTicker(tl.crunchLogSecondsBetweenEvents)
    
  • Rename the ThrottledLogger.stopping channel to ThrottledLogger.flush.
    logging.go:88:
            case _, open := <-tl.flush:
                // if closed, flush tl.buf, then exit the loop
                stopping = !open
    
  • ThrottledLogger.Write should check crunchLogBytesPerEvent and flush when exceeded.
    logging.go:78:
      if tl.buf >= tl.crunchLogBytesPerEvent {
        tl.flush <- struct{}
      }
    
  • This should reuse the variable 'now' that you've already stored previously.
    logging.go:220:
            arvlog.logThrottleResetTime = time.Now().Add(time.Second * time.Duration(int(crunchLogThrottlePeriod.(float64))))
    
  • There's no point in having ArvLogWriter.rateLimit return the value of logThrottleIsOpen (the first bool) if it is just discarded (L242)
  • Similarly, ArvLogWriter.rateLimit only return an error when checking arvlog.ArvClient.Discovery() but it shouldn't be doing that (1st comment) so it doesn't need to return err either.
  • Please rename stderrBufToFlush and stderrFlushedAt to bufToFlush and flushedAt because they are not stderr-specific.
  • This regular expression is wrong for crunch-run logs.
    var lineRegexp = regexp.MustCompile(`^\S+ \S+ \d+ \d+ stderr (.*)`)
    

    Should be
    var lineRegexp = regexp.MustCompile(`^\S+ (.*)`)
    
  • Consider passing in a single value for now to ArvLogWriter.rateLimit() instead of calling time.Now() many times

#7 Updated by Peter Amstutz over 4 years ago

To amend my previous comment, from crunch_dispatch.rb:

          # rate_limit returns true or false as to whether to actually log
          # the line or not.  It also modifies "line" in place to replace
          # it with an error if a logging limit is tripped.

logging.go is ignoring the return value of ArvLogWriter.rateLimit(), but it shouldn't.

#8 Updated by Radhika Chippada over 4 years ago

Peter: Addressed all your comments at c5c09df, except "ThrottledLogger.Write should check crunchLogBytesPerEvent and flush when exceeded"

When I try to add this to ThrottledLogger.Write, the tests are hanging (I ran all the tests in crunch-run and do not know which specific test(s) are hanging). Please advise.

#9 Updated by Radhika Chippada over 4 years ago

  • Target version changed from 2017-04-26 sprint to 2017-05-10 sprint
  • Story points changed from 2.0 to 0.0

#10 Updated by Peter Amstutz over 4 years ago

8019-crunchrun-log-throttle @ c5c09df38966595b4f27c402d1e9ae5500d6d201

  • Is there a particular reason to skip empty lines?
    logging.go:239:
            } else if len(line) == 0 {
                continue
            }
    
  • Can you update this comment to say "once per crunchLogSecondsBetweenEvents"
    logging.go:161:
    // (b) batches log messages and only calls the underlying Writer at most once
    // per second.
    
  • It already sets arvlog.logThrottleIsOpen inside rateLimit. The return value is supposed to be whether to log that particular line or not. So this should be changed from:
    logging.go:244:
    logOpen, msg := arvlog.rateLimit(line, now)
    arvlog.bufToFlush.WriteString(string(msg) + "\n")
    rvlog.logThrottleIsOpen = logOpen
    

    to
    logOpen, msg := arvlog.rateLimit(line, now)
    if logOpen {
         arvlog.bufToFlush.WriteString(string(msg) + "\n")
    }
    
  • The logic for "Rate-limiting partial segments of long lines" doesn't make sense. I think you want remove the "partialLine" and "skipCounts" flags and adjust the if statement (this works in conjunction with the previous point, where returning false suppresses just the current line instead of closing the throttle entirely):
    logging.go:295:
    if now.After(arvlog.logThrottlePartialLineLastAt.Add(time.Second * time.Duration(crunchLogPartialLineThrottlePeriod))) {
        arvlog.logThrottlePartialLineLastAt = now
    } else {
        return false, line
    }
    
  • You mentioned tests are deadlocking when you write to the flush channel from ThrottledLogger.Write. I think I know why. If that method gets called multiple times, it would send the flush signal multiple times. Since the channel size is 1, it blocks after the 1st send. So you need an additional flag:
    type ThrottledLogger struct {
    ...
      pendingFlush bool
    }
    

    func (tl *ThrottledLogger) Write(p []byte) (n int, err error) {
    ...
      if len(tl.buf) >= crunchLogBytesPerEvent && !tl.pendingFLush {
        tl.pendingFlush = true
        tl.flush <- struct{}
      } 
    }
    

    func (tl *ThrottledLogger) flusher() {
    ...
            tl.Mutex.Lock()
            ready, tl.buf = tl.buf, nil
                    tl.pendingFlush = false
            tl.Mutex.Unlock()
    }
    

#11 Updated by Radhika Chippada over 4 years ago

Is there a particular reason to skip empty lines?

This is how we are doing in the original code (crunch_dispatch.rb) as well. Removing this is resulting an extra empty line at the end of log file, which would need updating the tests. I am not sure if empty lines are desired. Left this as is.

Can you update this comment to say "once per crunchLogSecondsBetweenEvents"

Done

It already sets arvlog.logThrottleIsOpen inside rateLimit. The return value is supposed to be whether to log that particular line or not. So this should be changed from ...

Done

The logic for "Rate-limiting partial segments of long lines" doesn't make sense. I think you want remove the "partialLine" and "skipCounts" flags ...

Removed skipCounts, but need partialLine flag to ensure we only log the message once per "Rate-limiting partial segments of long lines to one every n seconds". Also fixed an issue in block at line 301 in this regards

You mentioned tests are deadlocking when you write to the flush channel from ThrottledLogger.Write. I think I know why. If that method gets called multiple times, it would send the flush signal multiple times. Since the channel size is 1, it blocks after the 1st send. So you need an additional flag "pendingFlush"

Even after adding this the tests are "taking too long". On further debugging, I narrowed it down to the test "TestWriteLogsLarge". It is basically flooding the log and causing many flusher events from tl.Write func. Commented out this logic for now. It does not appear that we want to do this for the case where we have log flooding situation. Do we instead want to set the ticker in flusher func back to 1 or a factor of crunchLogSecondsBetweenEvents, whichever is greater?

#12 Updated by Radhika Chippada over 4 years ago

At e54bce82

Two tests "TestCrunchStat" and "TestNodeInfoLog" were failing. This was due to rate limiting alone causing loss of log data (delaying writing to server until the bufToFlush or bufFlushedAt conditions were exceeded). The following update was needed.

-       if int64(arvlog.bufToFlush.Len()) > crunchLogBytesPerEvent ||
-               (now.Sub(arvlog.bufFlushedAt) >= crunchLogSecondsBetweenEvents) {
+       if (int64(arvlog.bufToFlush.Len()) > crunchLogBytesPerEvent ||
+               (now.Sub(arvlog.bufFlushedAt) >= crunchLogSecondsBetweenEvents) ||
+               arvlog.closing) && (arvlog.bufToFlush.Len() > 0) {

However, this required commenting out line 638 in crunchrun_test.rb . With this line in, TestWriteLogsLarge is failing (without this test, no issues and all other tests pass). The test suite panicking because there is no api.Calls in api.Content

c.Check(api.Content[api.Calls-1]["container"].(arvadosclient.Dict)["log"], NotNil)

Please take a look and comment as to how to proceed. Thanks

#13 Updated by Radhika Chippada over 4 years ago

Error log:

 ********** Running services/crunch-run tests **********

2017/05/03 13:46:14 [notice] 12463#12463: using the "epoll" event method
2017/05/03 13:46:14 [notice] 12463#12463: nginx/1.10.2
2017/05/03 13:46:14 [notice] 12463#12463: OS: Linux 4.8.0-49-generic
2017/05/03 13:46:14 [notice] 12463#12463: getrlimit(RLIMIT_NOFILE): 1024:65536
2017/05/03 13:46:14 [notice] 12463#12463: start worker processes
2017/05/03 13:46:14 [notice] 12463#12463: start worker process 12477
SIGQUIT: quit
PC=0x45cb21 m=0 sigcode=0

goroutine 0 [idle]:
runtime.futex(0xc1fd70, 0x0, 0x0, 0x0, 0x0, 0xbb68eef0, 0x0, 0x0, 0x7ffcbb68ef00, 0x40fbcb, ...)
    /usr/lib/go-1.8/src/runtime/sys_linux_amd64.s:422 +0x21
runtime.futexsleep(0xc1fd70, 0x0, 0xffffffffffffffff)
    /usr/lib/go-1.8/src/runtime/os_linux.go:45 +0x62
runtime.notesleep(0xc1fd70)
    /usr/lib/go-1.8/src/runtime/lock_futex.go:145 +0x6b
runtime.stopm()
    /usr/lib/go-1.8/src/runtime/proc.go:1650 +0xad
runtime.findrunnable(0xc42001e000, 0x0)
    /usr/lib/go-1.8/src/runtime/proc.go:2102 +0x2e4
runtime.schedule()
    /usr/lib/go-1.8/src/runtime/proc.go:2222 +0x14c
runtime.park_m(0xc420060340)
    /usr/lib/go-1.8/src/runtime/proc.go:2285 +0xab
runtime.mcall(0x7ffcbb68f090)
    /usr/lib/go-1.8/src/runtime/asm_amd64.s:269 +0x5b

goroutine 1 [chan receive, 10 minutes]:
testing.(*T).Run(0xc420067a00, 0x8365cb, 0xe, 0x84b1f0, 0xc4201afd20)
    /usr/lib/go-1.8/src/testing/testing.go:698 +0x2f4
testing.runTests.func1(0xc420067a00)
    /usr/lib/go-1.8/src/testing/testing.go:882 +0x67
testing.tRunner(0xc420067a00, 0xc4201afde0)
    /usr/lib/go-1.8/src/testing/testing.go:657 +0x96
testing.runTests(0xc420294540, 0xbea770, 0x1, 0x1, 0x20)
    /usr/lib/go-1.8/src/testing/testing.go:888 +0x2c1
testing.(*M).Run(0xc420501f20, 0xc4201aff20)
    /usr/lib/go-1.8/src/testing/testing.go:822 +0xfc
main.main()
    _/home/radhika/arvados/services/crunch-run/_test/_testmain.go:42 +0xf7

goroutine 17 [syscall, 10 minutes, locked to thread]:
runtime.goexit()
    /usr/lib/go-1.8/src/runtime/asm_amd64.s:2197 +0x1

goroutine 4 [syscall, 10 minutes]:
os/signal.signal_recv(0x0)
    /usr/lib/go-1.8/src/runtime/sigqueue.go:116 +0x104
os/signal.loop()
    /usr/lib/go-1.8/src/os/signal/signal_unix.go:22 +0x22
created by os/signal.init.1
    /usr/lib/go-1.8/src/os/signal/signal_unix.go:28 +0x41

goroutine 5 [chan receive, 10 minutes]:
gopkg.in/check%2ev1.(*suiteRunner).runTest(0xc4201cdd80, 0xc420308620, 0x0)
    /tmp/tmp.B1XT4Ndn22/GOPATH/src/gopkg.in/check.v1/check.go:810 +0x61
gopkg.in/check%2ev1.(*suiteRunner).run(0xc4201cdd80, 0xc3b8a0)
    /tmp/tmp.B1XT4Ndn22/GOPATH/src/gopkg.in/check.v1/check.go:615 +0xf4
gopkg.in/check%2ev1.Run(0x7eb320, 0xc3b8a0, 0xc4202113c0, 0xc4200a83f0)
    /tmp/tmp.B1XT4Ndn22/GOPATH/src/gopkg.in/check.v1/run.go:92 +0x4d
gopkg.in/check%2ev1.RunAll(0xc4202113c0, 0x7ffcbb68ffb4)
    /tmp/tmp.B1XT4Ndn22/GOPATH/src/gopkg.in/check.v1/run.go:84 +0xed
gopkg.in/check%2ev1.TestingT(0xc420067ad0)
    /tmp/tmp.B1XT4Ndn22/GOPATH/src/gopkg.in/check.v1/run.go:72 +0x3f7
_/home/radhika/arvados/services/crunch-run.TestCrunchExec(0xc420067ad0)
    /home/radhika/arvados/services/crunch-run/crunchrun_test.go:38 +0x2b
testing.tRunner(0xc420067ad0, 0x84b1f0)
    /usr/lib/go-1.8/src/testing/testing.go:657 +0x96
created by testing.(*T).Run
    /usr/lib/go-1.8/src/testing/testing.go:697 +0x2ca

goroutine 6 [select, 10 minutes]:
gopkg.in/check%2ev1.(*resultTracker)._loopRoutine(0xc4200a8480)
    /tmp/tmp.B1XT4Ndn22/GOPATH/src/gopkg.in/check.v1/check.go:461 +0x416
created by gopkg.in/check%2ev1.(*resultTracker).start
    /tmp/tmp.B1XT4Ndn22/GOPATH/src/gopkg.in/check.v1/check.go:441 +0x3f

goroutine 7 [chan receive, 9 minutes]:
_/home/radhika/arvados/services/crunch-run.(*ThrottledLogger).Close(0xc4200620a0, 0xc420503aa8, 0x1)
    /home/radhika/arvados/services/crunch-run/logging.go:128 +0x78
_/home/radhika/arvados/services/crunch-run.(*LoggingTestSuite).TestWriteLogsLarge(0xc3b8a0, 0xc4201da870)
    /home/radhika/arvados/services/crunch-run/logging_test.go:68 +0x31b
reflect.Value.call(0x7eb320, 0xc3b8a0, 0x613, 0x8326e0, 0x4, 0xc42003df70, 0x1, 0x1, 0xc3cf60, 0x82a960, ...)
    /usr/lib/go-1.8/src/reflect/value.go:434 +0x91f
reflect.Value.Call(0x7eb320, 0xc3b8a0, 0x613, 0xc42003df70, 0x1, 0x1, 0x0, 0xc42003df07, 0xc4200b4820)
    /usr/lib/go-1.8/src/reflect/value.go:302 +0xa4
gopkg.in/check%2ev1.(*suiteRunner).forkTest.func1(0xc4201da870)
    /tmp/tmp.B1XT4Ndn22/GOPATH/src/gopkg.in/check.v1/check.go:772 +0x643
gopkg.in/check%2ev1.(*suiteRunner).forkCall.func1(0xc4201cdd80, 0xc4201da870, 0xc420295180)
    /tmp/tmp.B1XT4Ndn22/GOPATH/src/gopkg.in/check.v1/check.go:666 +0x7c
created by gopkg.in/check%2ev1.(*suiteRunner).forkCall
    /tmp/tmp.B1XT4Ndn22/GOPATH/src/gopkg.in/check.v1/check.go:667 +0x230

goroutine 8 [chan receive, 9 minutes]:
_/home/radhika/arvados/services/crunch-run.(*CollectionFileWriter).goUpload(0xc420062050, 0xc420306ea0)
    /home/radhika/arvados/services/crunch-run/upload.go:97 +0xe6
created by _/home/radhika/arvados/services/crunch-run.(*CollectionWriter).Open
    /home/radhika/arvados/services/crunch-run/upload.go:170 +0x323

goroutine 9 [select]:
_/home/radhika/arvados/services/crunch-run.(*ThrottledLogger).flusher(0xc4200620a0)
    /home/radhika/arvados/services/crunch-run/logging.go:99 +0x28d
created by _/home/radhika/arvados/services/crunch-run.NewThrottledLogger
    /home/radhika/arvados/services/crunch-run/logging.go:181 +0x180

rax    0xca
rbx    0x0
rcx    0x45cb23
rdx    0x0
rdi    0xc1fd70
rsi    0x0
rbp    0x7ffcbb68eed0
rsp    0x7ffcbb68ee88
r8     0x0
r9     0x0
r10    0x0
r11    0x286
r12    0xc42001a2c0
r13    0xc42034a038
r14    0xc420306f00
r15    0x0
rip    0x45cb21
rflags 0x286
cs     0x33
fs     0x0
gs     0x0
*** Test killed with quit: ran too long (10m0s).
FAIL    _/home/radhika/arvados/services/crunch-run    600.042s

#14 Updated by Radhika Chippada over 4 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 0 to 100

Applied in changeset arvados|commit:3b4325c210516d1f61838fb26e06e0a11c31ce6d.

Also available in: Atom PDF