Feature #11906

Basic authenticated http health check ("ping") for each system service

Added by Tom Clegg 5 months ago. Updated 4 months ago.

Status:ResolvedStart date:07/17/2017
Priority:NormalDue date:
Assignee:Radhika Chippada% Done:

100%

Category:Deployment
Target version:2017-08-02 sprint
Story points3.0Remaining (hours)0.00 hour
Velocity based estimate0 days

Description

Functional details:
  • respond to “GET /_health/ping” on same addr:port as the microservice’s main http server (or “management” server in nodemanager’s case)
  • expect header “Authorization: Bearer XXX” where XXX is “ManagementToken” in config file (note "Bearer", not "OAuth2")
  • return 404 if configured management token is blank / missing
  • return 401 if Authorization: header is missing
  • return 403 if Authorization: header does not match configured token
  • return JSON, either {"health":"OK"} or {"health":"ERROR"}

It’s OK if the “ping” healthcheck has lots of false-positive potential -- even if “OK” merely means the process is running. The focus here is instrumenting all services, not detecting all failure modes.


Subtasks

Task #11978: Review branch 11906-keepstore-pingResolvedRadhika Chippada

Task #11981: Review branch 11906-api-pingResolvedLucas Di Pentima

Task #11987: Review branch 11906-wb-pingResolvedLucas Di Pentima

Task #12000: Review 11906-health-check-libResolvedRadhika Chippada

Task #11996: Refactor health check handlers into an SDK libraryResolvedTom Clegg

Task #12021: Review branch 11906-keepproxy-pingResolvedLucas Di Pentima

Task #12043: Review branch 11906-arv-git-httpd-pingResolvedLucas Di Pentima

Task #12041: Review branch 11906-keep-web-pingResolvedRadhika Chippada

Task #12049: Review branch 11906-nodemanager-pingResolvedLucas Di Pentima


Related issues

Related to Arvados - Bug #11901: [arvados-ws] Fix leaking postgres connections and subsequ... Resolved 06/26/2017
Related to Arvados - Story #11748: Standard convention for Nagios-compatible health check sc... New
Related to Arvados - Feature #12260: Healthcheck endpoint aggregator Feedback 10/10/2017

Associated revisions

Revision 5b7be380
Added by Radhika Chippada 4 months ago

refs #11906

Merge branch '11906-keepstore-ping'

Arvados-DCO-1.1-Signed-off-by: Radhika Chippada <>

Revision e65fffb6
Added by Radhika Chippada 4 months ago

refs #11906

Merge branch '11906-api-ping'

Arvados-DCO-1.1-Signed-off-by: Radhika Chippada <>

Revision 818daa27
Added by Radhika Chippada 4 months ago

refs #11906

Merge branch '11906-api-ping'

Arvados-DCO-1.1-Signed-off-by: Radhika Chippada <>

Revision 3f7bde60
Added by Radhika Chippada 4 months ago

refs #11906

Merge branch '11906-wb-ping'

Arvados-DCO-1.1-Signed-off-by: Radhika Chippada <>

Revision 3574793a
Added by Tom Clegg 4 months ago

Merge branch '11906-health-check-lib'

refs #11906

Arvados-DCO-1.1-Signed-off-by: Tom Clegg <>

Revision f49e418e
Added by Radhika Chippada 4 months ago

refs #11906

Merge branch '11906-keepproxy-ping'

Arvados-DCO-1.1-Signed-off-by: Radhika Chippada <>

Revision 153ffc85
Added by Radhika Chippada 4 months ago

refs #11906

Merge branch '11906-keep-web-ping'

Arvados-DCO-1.1-Signed-off-by: Radhika Chippada <>

Revision 42c06689
Added by Radhika Chippada 4 months ago

refs #11906

Merge branch '11906-arv-git-httpd-ping'

Arvados-DCO-1.1-Signed-off-by: Radhika Chippada <>

Revision dfc4c2c7
Added by Radhika Chippada 4 months ago

closes #11906

Merge branch '11906-nodemanager-ping'

Arvados-DCO-1.1-Signed-off-by: Radhika Chippada <>

History

#2 Updated by Tom Clegg 5 months ago

including
  • apiserver
  • workbench
  • arv-git-httpd
  • nodemanager
  • keepstore
  • keepproxy
  • keep-web
  • arvados-ws (already done #11901)
  • keep-balance (has no http server)
  • crunch-dispatch-slurm (has no http server)
  • webshell/pam (has no http server)
  • arv-mount (has no http server)
  • login-sync (has no http server)

#3 Updated by Tom Morris 5 months ago

  • Story points set to 3.0

#4 Updated by Radhika Chippada 4 months ago

  • Status changed from New to In Progress
  • Target version set to 2017-07-19 sprint

#5 Updated by Lucas Di Pentima 4 months ago

Reviewing branch 11906-keepstore-ping - 499e303783a9b5430147ac4eef3bd13ff72968b9

Local test run services/keepstore was successful, just one question:

  • File services/keepstore/handlers.go
    • Lines 643-646: Shouldn’t the response be inside an “else” clause? Go docs say that the http.Error caller should ensure no further writes are made to ResponseWriter.

#6 Updated by Radhika Chippada 4 months ago

@ e6b3f6f6

Lines 643-646 of handlers.go: Shouldn’t the response be inside an “else” clause? Go docs say that the http.Error caller should ensure no further writes are made to ResponseWriter.

You are right; I should add a "return" statement to that if block. (As a side note, go fmt complains about "pyramid" code and prefers a return in the if block than an else block)

#7 Updated by Lucas Di Pentima 4 months ago

This LGTM, thanks!

#8 Updated by Radhika Chippada 4 months ago

11906-api-ping @ 943b3e807126ba5dc936ceeeddbee9801dab7308

  • Implemented /_health/ping for API server and added controller tests.
  • To test manually:
    • Add "management_token: abcdefg" to your application.yml file (and restart)
    • curl --insecure https://localhost:3030/_health/ping -H "Authorization: Bearer abcdefg" (assuming your api server is running at localhost:3030)

#9 Updated by Lucas Di Pentima 4 months ago

  • File services/api/app/controllers/arvados/v1/healthcheck_controller.rb
    • Lines 25-29: We could save one level of conditionals by getting the auth_header before line 21, and chain if/elsif/elsif/end, I think it would be a little more readable.
  • There’s a test failing: services/api/integration/errors_test.rb:26

#10 Updated by Radhika Chippada 4 months ago

File services/api/app/controllers/arvados/v1/healthcheck_controller.rb Lines 25-29

Updated

There’s a test failing: services/api/integration/errors_test.rb

Thanks for noticing this. I updated the test to add _health/* to routes

#11 Updated by Lucas Di Pentima 4 months ago

All services/api tests pass now, this LGTM. Thanks!

#12 Updated by Radhika Chippada 4 months ago

Branch 11906-wb-ping @ ffc117651144b197c9d650039a5403e972d54b7c

Adds _health/ping to workbench.

I also made small updates to API server implementation. The code merged in 11906-api-ping was returning responses of the format
{"errors":["authorization required"],"error_token":"1500405563+a2c23148"}. Updated to return {"errors":"authorization required"} instead.

Test run @ https://ci.curoverse.com/job/developer-run-tests/386/

#13 Updated by Lucas Di Pentima 4 months ago

Updates at ffc117651144b197c9d650039a5403e972d54b7c LGTM, just one tiny detail:

Line 20 at file apps/workbench/app/controllers/healthcheck_controller.rb has its indentation a bit off.

#14 Updated by Radhika Chippada 4 months ago

  • Target version changed from 2017-07-19 sprint to 2017-08-02 sprint

#15 Updated by Radhika Chippada 4 months ago

  • Assignee set to Radhika Chippada

#16 Updated by Tom Clegg 4 months ago

#17 Updated by Radhika Chippada 4 months ago

Branch 11906-health-check-lib:

  • In sdk handler.go, can we can call “type Func” as “type HealthFunc” or something to help improve readability. I am thinking if we go back to this code after 6 months, it might be a bit confusing
  • The comment “Map of URI paths to health-check Func. The prefix is omitted: Routes["foo"] is the health check invoked by a request to “/_health/foo” => Here should it say a request to “{prefix}/foo” instead?
  • In the setup func, is it safe to assume that the prefix always starts with “/“ or do we want to ensure it similar to HasSuffix handling? (I just tested with “_health” as prefix and the test failed)
  • Do we want to add a test with wrong prefix as well in sdk handler_test.go? h.ServeHTTP(resp, s.request(“/nosuchprefix/ping", goodToken))
  • TestZeroValue => TestNotConfigured?

#18 Updated by Tom Clegg 4 months ago

Radhika Chippada wrote:

  • In sdk handler.go, can we can call “type Func” as “type HealthFunc” or something to help improve readability. I am thinking if we go back to this code after 6 months, it might be a bit confusing

https://blog.golang.org/package-names says "avoid stutter": like http.Server, this is really health.Func.

  • The comment “Map of URI paths to health-check Func. The prefix is omitted: Routes["foo"] is the health check invoked by a request to “/_health/foo” => Here should it say a request to “{prefix}/foo” instead?

Fixed.

  • In the setup func, is it safe to assume that the prefix always starts with “/“ or do we want to ensure it similar to HasSuffix handling? (I just tested with “_health” as prefix and the test failed)

I think the caller is already responsible for building an http.Server and must know what paths look like in that context. I'm more inclined to remove the special suffix-adding behavior if anything. The example makes it clear what callers should do, and in the long run auto-fixing things just means committing to a more subtle API. Might as well just force all callers to call the same way right from the start.

  • Do we want to add a test with wrong prefix as well in sdk handler_test.go? h.ServeHTTP(resp, s.request(“/nosuchprefix/ping", goodToken))

Added.

  • TestZeroValue => TestNotConfigured?

TestZeroValueIsDisabled. And added another check.

11906-health-check-lib @ da037c1372879cb0ebc221c42ebdb86af669295b

#19 Updated by Radhika Chippada 4 months ago

11906-health-check-lib LGTM. Thanks.

#20 Updated by Lucas Di Pentima 4 months ago

Updates at 91f976b99d4e7f00d2c1fbfee75812c3b1b780c6 - branch 11906-keepproxy-ping LGTM, thanks!

#21 Updated by Radhika Chippada 4 months ago

Branch 11906-keep-web-ping @ 1af8267a0d11ad9f986aec5ec7853299f3cfeb0e implements /_health/ping for keep-web.

Test run @ https://ci.curoverse.com/job/developer-run-tests/398/

#22 Updated by Tom Clegg 4 months ago

I pushed a commit removing the func and servemux, and calling the health handler's ServeHTTP() directly instead.

The rest LGTM, if this looks OK to you.

11906-keep-web-ping @ 2d16a410f0121806bf847f9e6290dae66e96fc5f

#23 Updated by Lucas Di Pentima 4 months ago

Updates at dc929c778 - branch 11906-arv-git-httpd-ping LGTM, please merge.

#24 Updated by Lucas Di Pentima 4 months ago

Reviewed updates at 8ae04f9f4d252a99985d7ac047413318d2f1068b for nodemanager. Looks good to me, thanks!

#25 Updated by Radhika Chippada 4 months ago

  • Status changed from In Progress to Resolved

Applied in changeset arvados|commit:dfc4c2c74dc7a31ff0c0d307846b7e9525fdcf0e.

Also available in: Atom PDF