Project

General

Profile

Actions

Feature #5414

closed

[Keep] Use data stored remotely (not in local keepstore)

Added by Tom Clegg about 9 years ago. Updated about 5 years ago.

Status:
Closed
Priority:
Normal
Assigned To:
Category:
Keep
Target version:
-
Story points:
3.0

Subtasks 3 (1 open2 closed)

Task #5454: Prototype keep-to-s3 gatewayNewTom CleggActions
Task #5603: Review 5414-keep-service-hintsResolvedRadhika Chippada03/29/2015Actions
Task #5458: Update keep clients to heed proxy-uuid hintsResolvedTom Clegg03/29/2015Actions

Related issues

Related to Arvados - Feature #4717: [API] [SDKs] Read-only flag on keep_services table as hint to clients not to try to write.ResolvedRadhika Chippada05/12/2015Actions
Actions #1

Updated by Tom Clegg about 9 years ago

  • Category set to Keep
  • Assigned To set to Tom Clegg
Actions #2

Updated by Tom Clegg about 9 years ago

  • Status changed from New to In Progress
Actions #3

Updated by Tom Clegg about 9 years ago

5414-keep-service-hints branch @ 9e91179:

  • Support +K@uuid in Python library
  • Support +K@uuid and +K@prefix in Go library
  • Some Go style fixes, and better logging in keepproxy (see commit message)
Actions #4

Updated by Peter Amstutz about 9 years ago

Comments on https://arvados.org/projects/arvados/wiki/Keep_S3_gateway

Does POST /collection use the API token of the client to create the collection?

If POST /collection just initiates a indexing process with S3, it should return 202 (Accepted) not 200. "The entity returned with this response SHOULD include an indication of the request's current status and either a pointer to a status monitor or some estimate of when the user can expect the request to be fulfilled."

Suggest additional API GET /collection/{uuid}/status, POST /collection/{uuid}/cancel@ for monitoring/control of indexing process. The "status" endpoint could be polled, or possibly return streaming status updates using chunked transfer encoding.

Suggest using an in-process database like SQLite to store the {hash, remote object} mapping instead of lots of tiny files.

Suggest GET /collections to get a list of collections managed by the gateway.

What's the use case for POST /manifest_text ?

I think it would be better to require that the client calling POST /collections provide the S3 credentials instead of having the server be configured with a single set of credentials. This way, clients can't ask the indexing server to index anything that the client doesn't already have access to.

Neutral on refactoring keepstore/keepproxy, except to mention that I did originally suggest that the keepproxy functionality be a module of keepstore instead of a separate server, but was rejected and it turns out that except for exposing similar REST APIs, there isn't very much overlap that isn't already in the Go SDK, and the actual HTTPServer handlers which do most of the work are completely different.

Actions #5

Updated by Radhika Chippada about 9 years ago

Review comments:

  • Mostly looks good and I was able to follow the updates towards the syntax and comment updates. I only have a few minor comments.
  • Just an observation. LocalRoots and GatewayRoots doing lock: I am understating this as, even though we are just returning a field from within kc, we want to prevent data changing due to another thread doing SetServiceRoots on the same object for example.
  • As long as we are trying to be standards complaint to the extent possible with comments and names etc, I think “pathPattern” would be better than “pathpattern”
  • keepclient_test.go
    • Rename “func (s *StandaloneSuite) TestGetWithServiceHint(c *C) ...” as “func (s *StandaloneSuite) TestGetWithServiceHintLocalUnused(c *C) ...” to make room for the following test?
    • Would it make sense to add a test with both local and gateway server configurations matching hash+"+K@"+uuid and verify that get happens from the gateway configuration? If I am understanding correctly, the gateway config takes higher precedence over the local config?
    • It does not seem like we have a test with keep proxy hint? (Please forgive me if I missed it in my reading the code). If not, would it make sense to add one?
  • Why did you say “Support +K@uuid in Python library” in note #3? It appears that you also added +K@prefix support in python?
  • Does it make sense to add additional tests as listed in keepclient_test.go item above for the python test suite also?
  • Peter’s comments above makes me wonder if the S3 interface design is considering writing collection objects into S3. I am assuming that this is not the case and we will continue to use our current object store (postgres) to store collections. Here is an article that talks about S3 and performance with small files: http://blog.mortardata.com/post/58920122308/s3-hadoop-performance
Actions #6

Updated by Tom Clegg about 9 years ago

  • Target version changed from 2015-04-01 sprint to Arvados Future Sprints
Actions #7

Updated by Tom Clegg about 9 years ago

Radhika Chippada wrote:

Review comments:

  • Mostly looks good and I was able to follow the updates towards the syntax and comment updates. I only have a few minor comments.
  • Just an observation. LocalRoots and GatewayRoots doing lock: I am understating this as, even though we are just returning a field from within kc, we want to prevent data changing due to another thread doing SetServiceRoots on the same object for example.

Right, RLock() avoids race conditions if someone calls SetServiceRoots() while we are doing "get pointer". (The existence of the atomic.LoadPointer() suggests to me we shouldn't assume "get pointer" is atomic.)

  • As long as we are trying to be standards complaint to the extent possible with comments and names etc, I think “pathPattern” would be better than “pathpattern”

Agreed. Changed to locatorMatcher... and while I was at it noticed that whole Locator interface was a bit sketchy, and cleaned it up. (Now it returns an error if there was an error, instead of a Locator with empty fields, etc.) AFAICT the only purpose this whole thing actually serves anywhere is to get the size hint in keepproxy.go, but presumably the idea is to use it elsewhere eventually, so I figured it should be at least capable of handling all valid locators without losing info.

  • keepclient_test.go
    • Rename “func (s *StandaloneSuite) TestGetWithServiceHint(c *C) ...” as “func (s *StandaloneSuite) TestGetWithServiceHintLocalUnused(c *C) ...” to make room for the following test?
    • Would it make sense to add a test with both local and gateway server configurations matching hash+"+K@"+uuid and verify that get happens from the gateway configuration? If I am understanding correctly, the gateway config takes higher precedence over the local config?

The client should try the +K@{foo} services in the order specified (without regard to service type), then fall back to the usual local server probe order. I've added a test case where +K@{foo} points to a local service.

  • It does not seem like we have a test with keep proxy hint? (Please forgive me if I missed it in my reading the code). If not, would it make sense to add one?

That would be nice... I figured this would require mocking http requests, though, and I didn't quite get that far. (Do we have any examples of this sitting around somewhere?)

  • Why did you say “Support +K@uuid in Python library” in note #3? It appears that you also added +K@prefix support in python?

Python SDK already supported this, although I did rearrange it a bit while adding the +K@uuid support.

  • Does it make sense to add additional tests as listed in keepclient_test.go item above for the python test suite also?

test_get_with_gateway_hints_in_order() and test_get_with_remote_proxy_hint() are meant to cover various permutations; what do you have in mind?

  • Peter’s comments above makes me wonder if the S3 interface design is considering writing collection objects into S3. I am assuming that this is not the case and we will continue to use our current object store (postgres) to store collections. Here is an article that talks about S3 and performance with small files: http://blog.mortardata.com/post/58920122308/s3-hadoop-performance

No, there are no plans to store collection records / manifests in S3. (We don't even store them in Keep now, except as scratch space during crunch jobs...)

Actions #8

Updated by Tom Clegg about 9 years ago

Now at 37b57a0

Actions #9

Updated by Tom Clegg about 9 years ago

  • Target version changed from Arvados Future Sprints to Bug Triage
Actions #10

Updated by Tom Clegg about 9 years ago

  • Target version changed from Bug Triage to 2015-04-29 sprint
Actions #11

Updated by Radhika Chippada about 9 years ago

Tom:
Just a few minor observation about the updates.

  • formatting is off for keepclient.rb (may be you need to rerun go fmt ...)
  • would it make sense to also add +K hint to at least one of the TestMakeLocator version in keepclient_test.go?
  • Just noticed this file name in git diff and wondering why “zz_load_config.rb” was called as such instead of something like “arv_load_config.rb”? It appears that it was load_config.rb initially … Same question about the other sleepy file
  • You asked about keepproxy test mocking: “Do we have any examples of this sitting around somewhere?”. I do not seem to find any such tests. We can revisit this at a later time if adding such test(s) is not feasible at this time.
Actions #12

Updated by Tom Clegg about 9 years ago

Radhika Chippada wrote:

Tom:
Just a few minor observation about the updates.

  • formatting is off for keepclient.rb (may be you need to rerun go fmt ...)

Indeed, several formatting things fixed with gofmt -s, thanks.

  • would it make sense to also add +K hint to at least one of the TestMakeLocator version in keepclient_test.go?

Yes. Added a test with a badly formatted +K and an entirely unrecognizable +U, to make sure it's forward-compatible with future kinds of hints.

  • Just noticed this file name in git diff and wondering why “zz_load_config.rb” was called as such instead of something like “arv_load_config.rb”? It appears that it was load_config.rb initially … Same question about the other sleepy file

zz_load_config has to run after secret_token.rb (if the latter exists) and zz_preload_models has to run after zz_load_config. Alphabetical naming seems to be the official Rails solution for controlling load order, but I've added some "require_relative" statements and that seems to do what we need, explicitly and without weird filenames.

  • You asked about keepproxy test mocking: “Do we have any examples of this sitting around somewhere?”. I do not seem to find any such tests. We can revisit this at a later time if adding such test(s) is not feasible at this time.

OK. This would be a good thing to do but I don't want it to hold up the "service hints" stuff. (I probably should have put the "+K@zzzzz" piece in its own branch in the first place, so "remote proxy" test issues could hold up the "remote proxy" feature without also holding up the "gateway" feature...)

Actions #13

Updated by Radhika Chippada about 9 years ago

Tom, LGTM. And thank you for explaining initializer loading and renaming.

Actions #14

Updated by Tom Clegg about 9 years ago

  • Target version changed from 2015-04-29 sprint to 2015-05-20 sprint
Actions #15

Updated by Tom Clegg about 9 years ago

  • Target version changed from 2015-05-20 sprint to 2015-06-10 sprint
Actions #16

Updated by Brett Smith almost 9 years ago

  • Target version changed from 2015-06-10 sprint to Arvados Future Sprints
Actions #17

Updated by Tom Clegg about 7 years ago

  • Status changed from In Progress to Closed
Actions #18

Updated by Tom Morris about 5 years ago

  • Target version deleted (Arvados Future Sprints)
Actions

Also available in: Atom PDF