Story #13497

[API] Initial "arvados-controller" server that proxies API endpoints to Rails server

Added by Tom Clegg about 1 year ago. Updated 8 months ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Start date:
06/15/2018
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
0.5
Release:
Release relationship:
Auto

Description

Background

This is the first step toward retiring the Rails API server (#9053). It unblocks:
  • implementing new APIs in Go (without increasing the discovery/routing burden in every client/SDK)
  • porting individual APIs to Go (without having to update all clients/SDKs each time, or proxy requests through Rails)
  • federation routing (#13493)
Related goal:
  • refactor existing Go services as packages so they can be used in unit tests

Objective

This initial version changes the way requests are routed inside an API server node.
  • Before: client → Nginx → arvados-api-server
  • After: client → Nginx → arvados-controller → arvados-api-server
This version does not add or change any new API endpoints, port any existing API endpoints to Go, or implement load balancing or service discovery. For example:
  • Request and response headers are passed through blindly
  • All requests are proxied to one single arvados-api-server (Rails) service at the configured address and port (typically localhost:8000)

Requirements

Load configuration from the cluster configuration document from #12260. There will be no arvados-controller config file.


Subtasks

Task #13584: Review 13497-controllerResolvedTom Clegg

Task #13654: Update docsResolvedTom Clegg

Task #13655: Review 13497-controllerResolvedTom Clegg

Task #13938: Review 13497-controllerResolvedTom Clegg


Related issues

Related to Arvados - Bug #14383: [API] Java SDK double slash bug with arvados-controllerResolved02/13/2019

Blocks Arvados - Story #9053: [Epic] Port APIs to GoNew

Blocks Arvados - Feature #13493: Federated record retrievalResolved06/28/2018

Blocks Arvados - Story #13574: [Controller] Update container priorities asynchronouslyNew

Associated revisions

Revision fe561d69
Added by Tom Clegg about 1 year ago

13497: Merge branch 'master' into 13497-controller

refs #13497

Arvados-DCO-1.1-Signed-off-by: Tom Clegg <>

Revision 73ad2ee9
Added by Tom Clegg about 1 year ago

13497: Merge branch 'master' into 13497-controller

refs #13497

Arvados-DCO-1.1-Signed-off-by: Tom Clegg <>

Revision 41e15e62
Added by Tom Clegg about 1 year ago

Merge branch '13497-controller'

refs #13497

Arvados-DCO-1.1-Signed-off-by: Tom Clegg <>

Revision db5107dc
Added by Tom Clegg about 1 year ago

Merge branch '13497-controller'

refs #13497

Arvados-DCO-1.1-Signed-off-by: Tom Clegg <>

Revision ccfad8f8
Added by Tom Clegg 12 months ago

Merge branch '13497-controller'

refs #13497

Arvados-DCO-1.1-Signed-off-by: Tom Clegg <>

History

#1 Updated by Tom Clegg about 1 year ago

#2 Updated by Tom Clegg about 1 year ago

#3 Updated by Nico César about 1 year ago

API in Go, what a good time to be alive.

My idea of the "after" scenario is the following
client → Nginx → arvados-router → Nginx → arvados-api-server

because [Nginx → arvados-api-server] should be taken as one on how we are using passenger today. the first Nginx is because I like SSL to be handled by Nginx. This can be 1 instance of Nginx, and is all configuration based. I see an easy task for us to integrate all this. But...

let's talk why we need an arvados-router: We want to slowly shadow the legacy application with endpoints meaning:

  • Each deploy will have brave-new rules that include routing to the new services in Go and deprecating old API calls
  • Initially this will be blank and progressively we'll be adding them. Progressively meaning each new version may or may not have new rules.
  • Throughput should be good for our current needs and our future needs while this migration is happening (could be years)
  • Debugging of bottlenecks and post-morten logs have to be easy and fast
  • Bursty software development will require a lot of chained-changes in the shadowing-rules in few days then priorities can change leaving everythin as-is for months, so the resulting architecture has to be stable.

this can be done in several ways:

  1. have a file /etc/nginx/conf.d/arvados-router.conf that takes care of them in a "client → Nginx → Nginx → Nginx → arvados-api-server" configuration (I'm repeating here to match my above diagram, In the realistic scenario it's going to be one Nginx.)
  2. have a http router of some sort not done by us. I can think of Envoy here: https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/http_routing that is a good technology for other platforms as kubernetes.
  3. we develop arvados-router from scratch feeding the configuration from /etc/arvados/router/config.yml, that is updated in every "state of the legacy's shadowing"

Everything here has a trade-off:

  1. Nginx approach is easy to first-deploy, and we have to come up with a good way to update /etc/nginx/conf.d/arvados-router.conf in each version of the cluster bundle. Very proven technology and we already have it in front of api-server. There is no need of a external TCP off-band-and-back connection, including almost no change in memory pressure.
  2. Feature list of Envoy is good for any future expansions, but complexity isn't minor
  3. New development will require effort and time, same as above we have to come up with a good way to update /etc/arvados/router/config.yml in each version of the cluster bundle. Unknown throuput that it can handle (but most likely we'll be ok,) implications and debugging will require an Arvados Engineer most likely.

How is federation routing (#13493) will impact the functionality of this ?

My take is to go with option number 1 (nginx properly configured) and we can talk on /etc/nginx/conf.d/arvados-router.conf creation (it will have a different format as /etc/arvados/router/config.yml but very similar content) I quick idea here is a nginx-arvados-router.deb that only has that file, taken out of the arvados repo, it reloads nginx upon install.

Does it makes sense?

#4 Updated by Nico César about 1 year ago

After a talk with Tom this is a good idea, but the name "-router" is misleading. maybe Tom will come up with a better name. ;)

#5 Updated by Nico César about 1 year ago

For a lack of a better name I'll call it maestro for now (master/teacher in Spanish). here are the ports to use

client → [443/HTTPS]Nginx → [7900/HTTP]arvados-maestro → [8000/HTTP]Nginx → arvados-api-server

the initial phase arvados maestro will be a pass-thru daemon doing no work at all. just to have it in-place and measure

  • latency introduced.
  • memory usage
  • CPU / other resources used.

Initially if this is in the API VM major network issues wont be a problem. As the microservice grows, we'll be moving it to a separte VM. I also want to see in the early stages tests in a separate VM, so we have early detection of problems, Network (latency/outage) being the most notable but some other environmental issues like NTP/DNS etc.

this will have to be deployed in all clusters in this no-op mode, this will require the adaptation of current nginx configurations via puppet. plus the needed package creation of arvados-maestro. I'll do some tickets about this when the time comes.

#6 Updated by Tom Morris about 1 year ago

  • Target version set to To Be Groomed

It seems like there must be a number of ready-built options that we could adopt here.

Do we have a list of candidates to evaluate?

#7 Updated by Tom Clegg about 1 year ago

  • Subject changed from [API] Initial "arvados-router" server that proxies API endpoints to Rails server to [API] Initial "arvados-controller" server that proxies API endpoints to Rails server
  • Description updated (diff)

Renamed from "router" to "controller".

This component is a replacement for the Rails API server.

#9 Updated by Tom Morris about 1 year ago

  • Target version changed from To Be Groomed to Arvados Future Sprints
  • Story points set to 2.0

#10 Updated by Tom Clegg about 1 year ago

  • Blocks Story #13574: [Controller] Update container priorities asynchronously added

#11 Updated by Tom Morris about 1 year ago

  • Assigned To set to Tom Clegg
  • Target version changed from Arvados Future Sprints to 2018-06-20 Sprint

#12 Updated by Tom Clegg about 1 year ago

  • Status changed from New to In Progress

#13 Updated by Tom Clegg about 1 year ago

13497-controller @ 21c5372c6b670820e842e01336eb6b191d6e10b7
  • new package "arvados-server" (currently only has "version" and "controller" subcommands)
  • new package "arvados-controller" (same binary as arvados-server, but comes with a systemd unit file, and installs the binary as /usr/bin/arvados-controller)
  • run-tests.sh routes integration tests' API traffic to controller (through Nginx+TLS) instead of Rails server
I'm thinking the transition can go something like this:
  1. Outline upgrade/install process -- see Installing controller service
  2. Review/merge this branch
  3. Document (here/wiki) how to update a site to use the arvados-controller service
  4. Refine docs with feedback from ops
  5. Confirm the service works on some real-life clusters
  6. Update the upgrade/install docs on doc.arvados.org accordingly
TODO:
  • refuse to start if Rails API port cannot be found in config (currently controller starts up but responds {"errors":["missing port in address"]})

#14 Updated by Tom Clegg about 1 year ago

13497-controller @ 23d57ba45b348b580fc584bbd77fe3960796622d
  • rename SystemNodes to NodeProfiles after discussion with Nico
  • add support for ARVADOS_NODE_PROFILE=x in /etc/arvados/environment as a way to select a profile without changing hostname or editing systemd files

#15 Updated by Lucas Di Pentima about 1 year ago

Although this is a fairly large update, I'm not finding any obvious issues so I don't want to block this merge much longer. LGTM, thanks!

#16 Updated by Tom Clegg about 1 year ago

  • Target version changed from 2018-06-20 Sprint to 2018-07-03 Sprint

#17 Updated by Tom Clegg about 1 year ago

  • Target version changed from 2018-07-03 Sprint to 2018-07-18 Sprint

#18 Updated by Tom Clegg about 1 year ago

  • Story points changed from 2.0 to 0.5

#19 Updated by Tom Clegg about 1 year ago

13497-controller @ 4369714821950366db98a54e4b62fdb5d09951a6
  • Fixes broken login/logout by propagating redirect responses back to client instead of following them.
  • Preserves original Host header in proxy requests (otherwise Rails uses its internal address like http://localhost:8000/ in redirect targets).

#21 Updated by Lucas Di Pentima about 1 year ago

There're some failing tests at: https://ci.curoverse.com/job/developer-run-tests/800/

I ran services/fuse tests locally without issues but sdk/python gave me errors about not finding "controller".

#22 Updated by Tom Clegg about 1 year ago

Turns out there are lots of places where scheme/vhost can get munged by proxies and not properly unmunged after they get used by the upstream server to construct redirect targets...

13497-controller @ f9a05f61abdf33891b09d62205d009d1cae73d1b https://ci.curoverse.com/job/developer-run-tests/807/

#24 Updated by Tom Clegg about 1 year ago

  • Target version changed from 2018-07-18 Sprint to 2018-08-01 Sprint

#25 Updated by Tom Clegg about 1 year ago

Some haphazardly chosen timing data from 4xphq

request id API controller timeTotal (s) rails duration (ms) delta (ms)
req-1fwee7391mf691vhppw7 GET /arvados/v1/virtual_machines/get_all_logins 0.082272 72.94 9.3
req-yhp9nhblckp5zb3p5083 GET /arvados/v1/jobs/queue 0.180192 171.8 8.4
req-11q47gm1taefs4azwaac GET /arvados/v1/containers 0.022545 6.99 15.6
req-hmbw176kftg11mgsg0ex POST /arvados/v1/collections 1.242400 1233.46 8.9

#26 Updated by Tom Clegg 12 months ago

13497-controller @ c8a4dee5e52feed137ca3cb4c4a4e224efbb694f
  • adds "install controller" to install guide and upgrade notes

#28 Updated by Tom Clegg 12 months ago

  • Status changed from In Progress to Resolved

#29 Updated by Tom Clegg 9 months ago

  • Related to Bug #14383: [API] Java SDK double slash bug with arvados-controller added

#30 Updated by Tom Morris 8 months ago

  • Release set to 13

Also available in: Atom PDF