Bug #9068

[Keep] keepstore should drop connections immediately if the client hangs up while waiting for bufferPool

Added by Tom Clegg about 5 years ago. Updated about 5 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Keep
Target version:
Start date:
04/26/2016
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
1.0

Description

Background

If a keepstore server is busy, a client can wait a long time in the queue before we start fulfilling its request. If the client times out and hangs up during this period, we waste a network socket (possibly reaching max open files) and (in the case of GET) we waste time retrieving the requested data from storage, because we don't notice that the client has given up.

Proposed fix

(from #8825)

Check CloseNotifier in GetBlockHandler

Move the bufferpool allocation step up from volume to handler. Add a MightExist() method to each volume type: return false if the block definitely doesn't exist, otherwise true (otherwise, we'll have a performance regression in that unix volumes would no longer be able to respond 404 without reserving a data buffer). With this change, we can check CloseNotifier in GetBlockHandler:
  • if the client hangs up before we get a buffer, don't keep waiting for a buffer, or do any other work on the request. Log 503.
  • if the client hangs up after we get a buffer, but before we finish reading data from disk/blob storage, we'll still waste some time. This is OK.

Check CloseNotifier in PutBlockHandler

Detecting abandoned PUT requests is easier: PutBlockHandler() has the http.Request object at the point where it waits for bufs.Get().

Something like this:

-buf := bufs.Get(int(req.ContentLength))
+var closeNotifier <-chan bool
+if resp, ok := resp.(http.CloseNotifier); ok {
+    closeNotifier = resp.CloseNotify()
+}
+var buf []byte
+bufReady := make(chan []byte)
+go func() {
+    bufReady <- bufs.Get(int(req.ContentLength))
+    close(bufReady)
+}()
+select {
+case buf = <-bufReady:
+case <-closeNotifier:
+    go func() {
+        // Even if closeNotifier happened first, we need to keep waiting
+        // for our buf so we can return it to the pool.
+        bufs.Put(<-bufReady)
+    }()
+    // Meanwhile we exit our handler to free up connection resources.
+    resp.WriteHeader(http.StatusServiceUnavailable)
+    resp.Write("Client disconnected")
+    return
+}
 _, err := io.ReadFull(req.Body, buf)

Subtasks

Task #9104: Drop abandoned GET requestsResolvedTom Clegg

Task #9105: Drop abandoned PUT requestsResolvedTom Clegg

Task #9099: Review 9068-drop-abandoned-connsResolvedRadhika Chippada


Related issues

Related to Arvados - Bug #8825: [Keepstore] many connections in CLOSE_WAITResolved03/07/2017

Associated revisions

Revision c7259164
Added by Tom Clegg about 5 years ago

Merge branch '9068-drop-abandoned-conns'

closes #9068

History

#1 Updated by Tom Clegg about 5 years ago

  • Category set to Keep
  • Target version set to Arvados Future Sprints

#2 Updated by Tom Clegg about 5 years ago

  • Target version changed from Arvados Future Sprints to 2016-05-11 sprint

#3 Updated by Tom Clegg about 5 years ago

  • Assigned To set to Tom Clegg

#4 Updated by Tom Clegg about 5 years ago

9068-drop-abandoned-conns @ 1b5e165

#5 Updated by Tom Clegg about 5 years ago

  • Status changed from New to In Progress

#6 Updated by Radhika Chippada about 5 years ago

Just a few review comments:

  • This comment is a bit confusing: “If the data in the backing store is bigger than len(buf), which will not exceed BlockSize,” Do you mean “if the len is bigger than BlockSize?”
  • In logging_router, we have “If upstream doesn’t implement …”: If I were to simulate this by returning nil in (w *LoggingResponseWriter) CloseNotify(), the tests hang. So, I am wondering if we need additional caution in such a scenario in handlers.getBufferForResponseWriter?
  • Need “go fmt” and one new golint error
  • run-tests.sh: thanks for fixing that annoying coverage related error message

#7 Updated by Tom Clegg about 5 years ago

Radhika Chippada wrote:

Just a few review comments:

  • This comment is a bit confusing: “If the data in the backing store is bigger than len(buf), which will not exceed BlockSize,” Do you mean “if the len is bigger than BlockSize?”

Indeed. I've split that into two distinct points:

    // If the data in the backing store is bigger than len(buf),
    // then Get is permitted to return an error without reading
    // any of the data.
    //
    // len(buf) will not exceed BlockSize.
  • In logging_router, we have “If upstream doesn’t implement …”: If I were to simulate this by returning nil in (w *LoggingResponseWriter) CloseNotify(), the tests hang. So, I am wondering if we need additional caution in such a scenario in handlers.getBufferForResponseWriter?

This just means the tests are working -- but it does seem silly to wait the full 10 minutes before timing out. Added a 20 second timeout and an error message, so now if you return nil from CloseNotify() you should see:

=== RUN   TestGetHandlerClientDisconnect   
2016/05/04 10:58:08 reached max buffers (1), waiting
--- FAIL: TestGetHandlerClientDisconnect (20.02s)
        handler_test.go:964: request took >20s, close notifier must be broken

(and all other tests should pass)

  • Need “go fmt” and one new golint error

Fixed, thanks.

Also fixed a race that leaked state between tests: the goroutine started by getBufferForResponseWriter would sometimes call bufs.Get() after the test case had finished, and end up using up the wrong buffer pool. Evil globals!

Now at 2814672

#8 Updated by Radhika Chippada about 5 years ago

Thanks for the updates. LGTM

#9 Updated by Tom Clegg about 5 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 67 to 100

Applied in changeset arvados|commit:c7259164cd07f09b5d63f39ddc850c50f249ded2.

Also available in: Atom PDF