Feature #7200

[Keep] keepproxy supports "index" API

Added by Tom Clegg over 5 years ago. Updated over 5 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Radhika Chippada
Category:
Keep
Target version:
Start date:
09/28/2015
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
1.0

Description

The response should follow the same API as keepstore's own index handler. In keepproxy terms this means:
  • Unless a complete index is retrieved for every service (i.e., http 200 and terminated by a single blank line), the index is incomplete, and keepproxy's response must not end with a blank line.
  • If a block appears 2x in each of N keepstore "index" responses, it appears 2N times in keepproxy's response, too.
  • If the index is complete, it must end with a single blank line.
  • The index is given in whatever order is convenient for keepproxy (the caller does not expect it to be sorted).

Implementation

New public function in sdk/go/keepclient:

// GetIndex retrieves a list of blocks stored on the given server whose hashes
// begin with the given prefix. The returned reader will return an error (other
// than EOF) if the complete index cannot be retrieved. This should only be
// expected to return useful results if the client is using a "data manager token" 
// recognized by the Keep services.
func (kc *KeepClient) GetIndex(keepServiceUUID, prefix string) (io.Reader, error) {
...
}

Tests for this function must include "caller sees error if partial index is retrieved from keepstore" (i.e., no blank line before EOF).

Some of the existing code in Data Manager may be useful here. (Data Manager might also deserve to be updated to use this new SDK feature instead of having duplicate code, but this might wait for a separate branch/story.)


Subtasks

Task #7390: Review branch 7200-keepproxy-index-apiResolvedTom Clegg


Related issues

Blocks Arvados - Bug #7167: [Deployment] Write an efficient Keep migration scriptResolved09/30/2015

Associated revisions

Revision 6c31bea9
Added by Radhika Chippada over 5 years ago

closes #7200
Merge branch '7200-keepproxy-index-api'

History

#1 Updated by Ward Vandewege over 5 years ago

  • Target version changed from Arvados Future Sprints to 2015-09-30 sprint

#2 Updated by Brett Smith over 5 years ago

  • Target version changed from 2015-09-30 sprint to Arvados Future Sprints

#3 Updated by Radhika Chippada over 5 years ago

  • Status changed from New to In Progress
  • Assigned To set to Radhika Chippada

#4 Updated by Tom Clegg over 5 years ago

  • Description updated (diff)

#5 Updated by Radhika Chippada over 5 years ago

  • Added GetIndex method support in keepclient and added several tests.
  • Added IndexHandler in keepproxy. Used the keepclient method to implement this handler. Added tests.
  • Added a couple more tests to keepstore/handler_test.go to verify response with valid but no such prefix and invalid prefix format.
  • Pending question: In keepclient how should this comment be addressed: "This should only be expected to return useful results if the client is using a data manager token recognized by the Keep services" ?

#6 Updated by Radhika Chippada over 5 years ago

  • Target version changed from Arvados Future Sprints to 2015-09-30 sprint

#7 Updated by Tom Clegg over 5 years ago

At eaf51fc

ErrorNoSuchKeepServer and ErrorIncompleteIndex should be ErrNoSuchKeepServer and ErrIncompleteIndex to match Go conventions (we already have some other offenders, but we should at least stick to Go idioms for new stuff).

ErrorGetIndex seems to mean "no information except that the real error was non-nil and was encountered in GetIndex". Seems like we should just return err itself in these three cases.

We shouldn't print debug logs from SDK methods. Just return an error, and let the application decide how to handle it.

Keeping the whole index in RAM isn't ideal but I think we can live with it for now, and refactor it later as a decorator that passes data from the underlying reader but still notices/interferes if the termination isn't correct. Meanwhile I think we should at least avoid making string copies of the byte slice:
  • -if (!strings.HasSuffix(string(respBody), "\n\n")) && (string(respBody) != "\n")
    +if !bytes.Equal(respBody, []byte("\n")) && !bytes.HasSuffix(respBody, []byte("\n\n"))
  • -return strings.NewReader(string(respBody)), nil
    +return bytes.NewReader(respBody), nil

Since the only purpose of the trailing newline is to detect an incomplete read even though HTTP can't signal errors encountered after the body starts sending (and Go readers don't have this limitation) I think GetIndex should strip the trailing newline from the data returned to the caller. That way the caller can use something like https://golang.org/pkg/bufio/#Reader.ReadString to read the data, without any special handling for the trailing newline.

It doesn't look like it will make any difference, but for the sake of realism the first two StubIndexHandlers' data (the "complete" ones) should have size and timestamp, perhaps like:
  • -        []byte(string(hash) + "\n\n")
    +        []byte(hash + "+3 1443559274\n\n")
  • (that string(hash) looks superfluous -- hash is already a string, right?)

In IndexHandler, unroll the error handling:

-       switch req.Method {
-       case "GET":
-               (normal flow)
-       default:
-               (handle error)
-               return
-       }
+       if req.Method != "GET" {
+               (handle error)
+               return
+       }
+       (normal flow)

There are two if err != nil { break } which seem like they should say status = http.StatusBadGateway; return instead -- that way we wouldn't need the switch err at the end of the func.

Rather than buffering all of the responses in RAM, we could just do resp.Write(readBytes) as soon as we have a complete index. (Aside from using less RAM, this might let us notice sooner and stop doing work if the client hangs up before we've fetched all indexes.)

Typo "implemenation" in keepproxy.go comment

The "GetIndex with prefix" test in keepproxy_test.go doesn't seem to check whether the results are actually restricted to the given prefix. It looks like we'll need to write another block in order to get different results from "GetIndex with no prefix" and "GetIndex with prefix". Granted, the test right after it ("valid but no such prefix") is fairly convincing, but still...

#8 Updated by Tom Clegg over 5 years ago

Radhika Chippada wrote:

  • Pending question: In keepclient how should this comment be addressed: "This should only be expected to return useful results if the client is using a data manager token recognized by the Keep services" ?

This doesn't require any new code, it was just meant as a hint to the SDK user that the API is only intended for admin/system use: "normal" users/tokens won't be able to get anything useful out of GetIndex. For example, if someone is hunting through the godoc pages for an API like "get a list of all blocks I can read" (an API that doesn't exist at all right now) it would be nice to avoid wasting their time trying to do it with GetIndex.

Suggestions for improving wording?

#9 Updated by Radhika Chippada over 5 years ago

ErrorNoSuchKeepServer and ErrorIncompleteIndex should be ErrNoSuchKeepServer ...

Done

ErrorGetIndex seems to mean "no information ...

Done

We shouldn't print debug logs from SDK methods. Just return an error, and let the application decide how to handle it.

Done

Keeping the whole index in RAM isn't ideal but I think we can live with it for now, and refactor it later as a decorator that passes data from the underlying reader but still notices/interferes if the termination isn't correct. Meanwhile I think we should at least avoid making string copies of the byte slice:

Done

Since the only purpose of the trailing newline is to detect an incomplete read even though HTTP can't signal errors encountered after the body starts sending (and Go readers don't have this limitation) I think GetIndex should strip the trailing newline from the data returned to the caller. That way the caller can use something like https://golang.org/pkg/bufio/#Reader.ReadString to read the data, without any special handling for the trailing newline.

Done. However, this now needed me add an extra new line in keepproxy. Long story short: LocalRoots includes keepservers as well as keepproxy urls. Hence, when keepclient tries to GetIndex from the proxy, it needs to terminate with "\n\n" and hence stripping it in keepclient (alas) makes keepproxy response to be "incomplete". What fun.

It doesn't look like it will make any difference, but for the sake of realism the first two StubIndexHandlers' data (the "complete" ones) should have size and timestamp, perhaps like:

Done

(that string(hash) looks superfluous -- hash is already a string, right?)

Done. Huh.

In IndexHandler, unroll the error handling:

Done. So much better now.

There are two if err != nil { break } which seem like they should say status = http.StatusBadGateway; return instead -- that way we wouldn't need the switch err at the end of the func.

Done

Rather than buffering all of the responses in RAM, we could just do resp.Write(readBytes) as soon as we have a complete index. (Aside from using less RAM, this might let us notice sooner and stop doing work if the client hangs up before we've fetched all indexes.)

Done

Typo "implemenation" in keepproxy.go comment

Done

The "GetIndex with prefix" test in keepproxy_test.go doesn't seem to check whether the results are actually restricted to the given prefix. It looks like we'll need to write another block in order to get different results from "GetIndex with no prefix" and "GetIndex with prefix". Granted, the test right after it ("valid but no such prefix") is fairly convincing, but still...

Done. Improved the tests to also check for "otherLocators" when no prefix is used and only expected locators are returned when a prefix is specified.

This doesn't require any new code, it was just meant as a hint to the SDK user that the API is only intended for admin/system use: "normal" users/tokens won't be able to get anything useful out of GetIndex. For example, if someone is hunting through the godoc pages for an API like "get a list of all blocks I can read" (an API that doesn't exist at all right now) it would be nice to avoid wasting their time trying to do it with GetIndex. Suggestions for improving wording?

It makes sense now. But, I think instead of "This should only be expected to return useful results ..." => "This will only return useful results ... " might make it easier to understand. wdyt?

#10 Updated by Tom Clegg over 5 years ago

Noticed another thing: I think we can remove all of the stuff about contentLen in (IndexHandler)ServeHTTP(). By the time we call resp.Header().Set(), the header has already been sent to the client, so it just goes nowhere. (Also, the trailing-newline check makes it unnecessary for the client to compare the Content-Length header with the actual number of bytes received.)

In GetIndex, resp.Body != nil is always true when we check it, according to docs -- "When err is nil, resp always contains a non-nil resp.Body." ...so we don't need that condition.

(re IRC question) yes, I think we need a defer resp.Body.Close() right before ioutil.ReadAll(resp.Body)

It makes sense now. But, I think instead of "This should only be expected to return useful results ..." => "This will only return useful results ... " might make it easier to understand. wdyt?

That sounds better. Or how about: "This is meant to be used only by system components and admin tools. It will return an error unless [...]"? Might help to make it a separate paragraph too.

#11 Updated by Radhika Chippada over 5 years ago

Addressed all the comments from note #10 as well. Thanks.

#12 Updated by Tom Clegg over 5 years ago

The "GetIndex with prefix" test looks better now except
  • c.Assert(totalLocators, Equals, matchingLocators) -- shouldn't totalLocators be greater than matchingLocators in this case?
  • It would be nice to use the same logic in both "GetIndex, then count results" loops. Currently one uses matchingLocators/otherLocators, the other uses totalLocators/matchingLocators. Perhaps something like
  • for _, spec := range []struct{
      prefix         string
      expectTestHash bool
      expectOther    bool
    }{
      {"", true, true},
      {"abcdef", false, false},
      {hash[:3], true, false},
    } {
      indexReader, err := kc.GetIndex("proxy", spec.prefix)
      ...
      gotTestHash := 0
      gotOther := 0
      for _, locator := range locators {
        c.Check(locator[:len(spec.prefix)], Equals, spec.prefix)
        if locator[:32] == hash {
          gotTestHash++
        } else {
          gotOther++
        }
      }
      c.Check(gotTestHash == 2, Equals, spec.expectTestHash)
      c.Check(gotOther > 0, Equals, spec.expectOther)
    }
    

#13 Updated by Tom Clegg over 5 years ago

Ooh. Assuming this is as easy as it looks, how about:

-               var readBytes []byte
-               readBytes, err = ioutil.ReadAll(reader)
-               if err != nil {
-                       status = http.StatusBadGateway
-                       return
-               }
-
-               // Got index for this server; write to resp
-               _, err := resp.Write(readBytes)
+               _, err := io.Copy(resp, reader)

#14 Updated by Tom Clegg over 5 years ago

Oops, I was wrong about the position of defer resp.Body.Close() in GetIndex in keepclient/keepclient.go. It should be right after we find out err != nil. (If we get err==nil and resp.StatusCode!=OK we still have to close resp.Body).

Rest LGTM, thanks!

#15 Updated by Radhika Chippada over 5 years ago

  • The test update with code reuse is much better. Used this with minor updates and additional comments.
  • Yes, io.Copy worked.
  • Updated the placement of resp.Body.Close()
  • Also, did a few minor updates around variable names and comments.

Thanks.

#16 Updated by Radhika Chippada over 5 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 0 to 100

Applied in changeset arvados|commit:6c31bea91f136bb300376847ced1fd965e037dd3.

Also available in: Atom PDF