Project

General

Profile

Actions

Bug #9068

closed

[Keep] keepstore should drop connections immediately if the client hangs up while waiting for bufferPool

Added by Tom Clegg almost 8 years ago. Updated almost 8 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Keep
Target version:
Story points:
1.0

Description

Background

If a keepstore server is busy, a client can wait a long time in the queue before we start fulfilling its request. If the client times out and hangs up during this period, we waste a network socket (possibly reaching max open files) and (in the case of GET) we waste time retrieving the requested data from storage, because we don't notice that the client has given up.

Proposed fix

(from #8825)

Check CloseNotifier in GetBlockHandler

Move the bufferpool allocation step up from volume to handler. Add a MightExist() method to each volume type: return false if the block definitely doesn't exist, otherwise true (otherwise, we'll have a performance regression in that unix volumes would no longer be able to respond 404 without reserving a data buffer). With this change, we can check CloseNotifier in GetBlockHandler:
  • if the client hangs up before we get a buffer, don't keep waiting for a buffer, or do any other work on the request. Log 503.
  • if the client hangs up after we get a buffer, but before we finish reading data from disk/blob storage, we'll still waste some time. This is OK.

Check CloseNotifier in PutBlockHandler

Detecting abandoned PUT requests is easier: PutBlockHandler() has the http.Request object at the point where it waits for bufs.Get().

Something like this:

-buf := bufs.Get(int(req.ContentLength))
+var closeNotifier <-chan bool
+if resp, ok := resp.(http.CloseNotifier); ok {
+    closeNotifier = resp.CloseNotify()
+}
+var buf []byte
+bufReady := make(chan []byte)
+go func() {
+    bufReady <- bufs.Get(int(req.ContentLength))
+    close(bufReady)
+}()
+select {
+case buf = <-bufReady:
+case <-closeNotifier:
+    go func() {
+        // Even if closeNotifier happened first, we need to keep waiting
+        // for our buf so we can return it to the pool.
+        bufs.Put(<-bufReady)
+    }()
+    // Meanwhile we exit our handler to free up connection resources.
+    resp.WriteHeader(http.StatusServiceUnavailable)
+    resp.Write("Client disconnected")
+    return
+}
 _, err := io.ReadFull(req.Body, buf)

Subtasks 3 (0 open3 closed)

Task #9104: Drop abandoned GET requestsResolvedTom Clegg04/26/2016Actions
Task #9105: Drop abandoned PUT requestsResolvedTom Clegg04/26/2016Actions
Task #9099: Review 9068-drop-abandoned-connsResolvedRadhika Chippada04/26/2016Actions

Related issues

Related to Arvados - Bug #8825: [Keepstore] many connections in CLOSE_WAITResolved03/07/2017Actions
Actions #1

Updated by Tom Clegg almost 8 years ago

  • Category set to Keep
  • Target version set to Arvados Future Sprints
Actions #2

Updated by Tom Clegg almost 8 years ago

  • Target version changed from Arvados Future Sprints to 2016-05-11 sprint
Actions #3

Updated by Tom Clegg almost 8 years ago

  • Assigned To set to Tom Clegg
Actions #4

Updated by Tom Clegg almost 8 years ago

9068-drop-abandoned-conns @ 1b5e165

Actions #5

Updated by Tom Clegg almost 8 years ago

  • Status changed from New to In Progress
Actions #6

Updated by Radhika Chippada almost 8 years ago

Just a few review comments:

  • This comment is a bit confusing: “If the data in the backing store is bigger than len(buf), which will not exceed BlockSize,” Do you mean “if the len is bigger than BlockSize?”
  • In logging_router, we have “If upstream doesn’t implement …”: If I were to simulate this by returning nil in (w *LoggingResponseWriter) CloseNotify(), the tests hang. So, I am wondering if we need additional caution in such a scenario in handlers.getBufferForResponseWriter?
  • Need “go fmt” and one new golint error
  • run-tests.sh: thanks for fixing that annoying coverage related error message
Actions #7

Updated by Tom Clegg almost 8 years ago

Radhika Chippada wrote:

Just a few review comments:

  • This comment is a bit confusing: “If the data in the backing store is bigger than len(buf), which will not exceed BlockSize,” Do you mean “if the len is bigger than BlockSize?”

Indeed. I've split that into two distinct points:

    // If the data in the backing store is bigger than len(buf),
    // then Get is permitted to return an error without reading
    // any of the data.
    //
    // len(buf) will not exceed BlockSize.
  • In logging_router, we have “If upstream doesn’t implement …”: If I were to simulate this by returning nil in (w *LoggingResponseWriter) CloseNotify(), the tests hang. So, I am wondering if we need additional caution in such a scenario in handlers.getBufferForResponseWriter?

This just means the tests are working -- but it does seem silly to wait the full 10 minutes before timing out. Added a 20 second timeout and an error message, so now if you return nil from CloseNotify() you should see:

=== RUN   TestGetHandlerClientDisconnect   
2016/05/04 10:58:08 reached max buffers (1), waiting
--- FAIL: TestGetHandlerClientDisconnect (20.02s)
        handler_test.go:964: request took >20s, close notifier must be broken

(and all other tests should pass)

  • Need “go fmt” and one new golint error

Fixed, thanks.

Also fixed a race that leaked state between tests: the goroutine started by getBufferForResponseWriter would sometimes call bufs.Get() after the test case had finished, and end up using up the wrong buffer pool. Evil globals!

Now at 2814672

Actions #8

Updated by Radhika Chippada almost 8 years ago

Thanks for the updates. LGTM

Actions #9

Updated by Tom Clegg almost 8 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 67 to 100

Applied in changeset arvados|commit:c7259164cd07f09b5d63f39ddc850c50f249ded2.

Actions

Also available in: Atom PDF