Keep proxy service: forward external requests to the appropriate keep servers
Useful for external and federated clients.Necessary:
- Use API host and token provided in environment variables to get the list of Keep servers at startup.
- Use node selection algorithm (see Python SDK) to select probe order when reading and writing.
Authorizationheader to Keep servers.
- use 2x replication, i.e., write to 2 Keep servers (future: client specifies this in a header)
- in response headers, say "X-Data-Replication: 2"
- 405 if
--no-GETflag was given.
- For now: do not implement GET. Do not even start up unless
--no-GETflag is given.
- 405 if
- GET /index*:
- PUT: verify Authorization header by requesting /users/current from apiserver, respond 401 if not valid.
- Periodically request the latest list of Keep servers. (This has an easy workaround: TERM the server as needed.)
- 405 if
--no-HEADflag was given, otherwise forward to Keep servers in probe order until it succeeds, send first successful response (or last failure, if none) to client
- 405 if
- Probably best to do this with Go. Perhaps something like https://github.com/elazarl/goproxy or even plain net/http. (But definitely as a separate project, not by adding "proxy mode" to the Keep server itself.)
Merge branch '1885-keep-proxy' of git.curoverse.com:arvados into 1885-keep-proxy refs #1885
- Please enter a commit message to explain why this merge is necessary,
- especially if it merges an updated upstream into a topic branch. #
- Lines starting with '#' will be ignored, and an empty message aborts
- the commit.
#11 Updated by Ward Vandewege over 7 years ago
go.sh: you need to force GOPATH to be absolute, for example like this:
rootdir=$(dirname $(readlink -f $0))
otherwise, go blows up like this if you run go.sh from a symlinked path:
go: GOPATH entry is relative; must be absolute path: ".".
Run 'go help gopath' for usage.
go.sh: needs documentation for arguments required. Is go.sh build keep what you are intending it to be run as?
go.sh: does not seem to replicate build.sh's functionality - it does not put the keep binary in bin/. Or maybe I'm not calling it the right way, cf. the previous point.
services/keep/src/arvados.org/keepproxy/keepproxy.go: has some commented out code that should be removed
services/keep/src/arvados.org/keepproxy/keepproxy.go: don't hardcode https on line 165, that should be configurable (but default to https) or ideally auto-discovered
services/keep/src/arvados.org/keepproxy/keepproxy_test.go: same, don't hardcode https on line 74
services/keep/src/arvados.org/keepproxy/keepproxy_test.go: these two commands need to check for failure and print the command output if run_test_server.py fails:
exec.Command("python", "run_test_server.py", "start").Run()
exec.Command("python", "run_test_server.py", "start_keep").Run()
FYI - I'm planning to restrict the number of concurrent connections to the Keep Proxy at the nginx level.
Any thoughts on expected memory use patterns of keep proxy? I assume the worst case scenario (in put mode) would be many clients all writing full 64MiB blocks, slowly. That scenario is what the limit on concurrent connections will protect against.
Please write a log line for every GET/HEAD/PUT request to STDOUT. Something simple will do for now. In Keep we use timestamp method /hash at the moment, and that's plenty fine.
Other than that, I'd say this looks pretty good! Tests pass.
#12 Updated by Peter Amstutz over 7 years ago
- Added comment to go.sh. Go.sh is just a wrapper around the go toolchain so regular go commands apply.
- "go install arvados.org/keepproxy" should build keepproxy and copy it to bin/
- The commented out code was to force the user to disable GET, which in Tom's original writeup was not supposed to be implemented (but I did it anyway). I'm not clear on the status of keep permissions, in theory the proxy supports it but I haven't tested it.
- Discussed https when calling API server on IRC
- Checks error return on .Run() now in the tests
- Regarding memory usage:
- GET requests should stream through with minimal buffering. It effectively goes directly from the http.Response reader from the Keep server directly into the http.ResponseWriter that sends to the client, with only a small stub that incrementally tallys the md5 checksum. The buffer used by the Go http library is something like 32 KiB
- PUT requests pre-allocate a buffer based on the Content-Length in the PUT header, and that buffer is held until the block has been sufficiently replicated or it gives up after contacting all Keep servers. That means we probably do want a relatively conservative cap on the number of concurrent PUT requests allowed, e.g. 16 * 64 MiB will require 1 GiB of RAM.
- Added a bunch of logging, and uncovered a rather serious and stupid bug in not correctly failing on authorization errors :-)
#13 Updated by Ward Vandewege over 7 years ago
This looks good to merge to me; just one change please: also log requests that are 404 or fail in other ways. It can just be the same log line as for a successful request for now, unless you want to upgrade it a little bit and add the HTTP response code in the log line, that would be nice.