Project

General

Profile

Actions

Feature #11906

closed

Basic authenticated http health check ("ping") for each system service

Added by Tom Clegg almost 7 years ago. Updated over 6 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Radhika Chippada
Category:
Deployment
Target version:
Story points:
3.0

Description

Functional details:
  • respond to “GET /_health/ping” on same addr:port as the microservice’s main http server (or “management” server in nodemanager’s case)
  • expect header “Authorization: Bearer XXX” where XXX is “ManagementToken” in config file (note "Bearer", not "OAuth2")
  • return 404 if configured management token is blank / missing
  • return 401 if Authorization: header is missing
  • return 403 if Authorization: header does not match configured token
  • return JSON, either {"health":"OK"} or {"health":"ERROR"}

It’s OK if the “ping” healthcheck has lots of false-positive potential -- even if “OK” merely means the process is running. The focus here is instrumenting all services, not detecting all failure modes.


Subtasks 9 (0 open9 closed)

Task #11978: Review branch 11906-keepstore-pingResolvedRadhika Chippada07/17/2017Actions
Task #11981: Review branch 11906-api-pingResolvedLucas Di Pentima07/18/2017Actions
Task #11987: Review branch 11906-wb-pingResolvedLucas Di Pentima07/18/2017Actions
Task #12000: Review 11906-health-check-libResolvedRadhika Chippada07/17/2017Actions
Task #11996: Refactor health check handlers into an SDK libraryResolvedTom Clegg07/17/2017Actions
Task #12021: Review branch 11906-keepproxy-pingResolvedLucas Di Pentima07/24/2017Actions
Task #12043: Review branch 11906-arv-git-httpd-pingResolvedLucas Di Pentima07/26/2017Actions
Task #12041: Review branch 11906-keep-web-pingResolvedRadhika Chippada07/26/2017Actions
Task #12049: Review branch 11906-nodemanager-pingResolvedLucas Di Pentima07/27/2017Actions

Related issues

Related to Arvados - Bug #11901: [arvados-ws] Fix leaking postgres connections and subsequent stallResolvedTom Clegg06/26/2017Actions
Related to Arvados - Idea #11748: Standard convention for Nagios-compatible health check scripts kept alongside source for servicesRejectedActions
Related to Arvados - Feature #12260: Healthcheck endpoint aggregatorResolvedTom Clegg10/10/2017Actions
Actions #2

Updated by Tom Clegg almost 7 years ago

including
  • apiserver
  • workbench
  • arv-git-httpd
  • nodemanager
  • keepstore
  • keepproxy
  • keep-web
  • arvados-ws (already done #11901)
  • keep-balance (has no http server)
  • crunch-dispatch-slurm (has no http server)
  • webshell/pam (has no http server)
  • arv-mount (has no http server)
  • login-sync (has no http server)
Actions #3

Updated by Tom Morris almost 7 years ago

  • Story points set to 3.0
Actions #4

Updated by Radhika Chippada almost 7 years ago

  • Status changed from New to In Progress
  • Target version set to 2017-07-19 sprint
Actions #5

Updated by Lucas Di Pentima almost 7 years ago

Reviewing branch 11906-keepstore-ping - 499e303783a9b5430147ac4eef3bd13ff72968b9

Local test run services/keepstore was successful, just one question:

  • File services/keepstore/handlers.go
    • Lines 643-646: Shouldn’t the response be inside an “else” clause? Go docs say that the http.Error caller should ensure no further writes are made to ResponseWriter.
Actions #6

Updated by Radhika Chippada almost 7 years ago

@ e6b3f6f6

Lines 643-646 of handlers.go: Shouldn’t the response be inside an “else” clause? Go docs say that the http.Error caller should ensure no further writes are made to ResponseWriter.

You are right; I should add a "return" statement to that if block. (As a side note, go fmt complains about "pyramid" code and prefers a return in the if block than an else block)

Actions #7

Updated by Lucas Di Pentima almost 7 years ago

This LGTM, thanks!

Actions #8

Updated by Radhika Chippada almost 7 years ago

11906-api-ping @ 943b3e807126ba5dc936ceeeddbee9801dab7308

  • Implemented /_health/ping for API server and added controller tests.
  • To test manually:
    • Add "management_token: abcdefg" to your application.yml file (and restart)
    • curl --insecure https://localhost:3030/_health/ping -H "Authorization: Bearer abcdefg" (assuming your api server is running at localhost:3030)
Actions #9

Updated by Lucas Di Pentima almost 7 years ago

  • File services/api/app/controllers/arvados/v1/healthcheck_controller.rb
    • Lines 25-29: We could save one level of conditionals by getting the auth_header before line 21, and chain if/elsif/elsif/end, I think it would be a little more readable.
  • There’s a test failing: services/api/integration/errors_test.rb:26
Actions #10

Updated by Radhika Chippada almost 7 years ago

File services/api/app/controllers/arvados/v1/healthcheck_controller.rb Lines 25-29

Updated

There’s a test failing: services/api/integration/errors_test.rb

Thanks for noticing this. I updated the test to add _health/* to routes

Actions #11

Updated by Lucas Di Pentima almost 7 years ago

All services/api tests pass now, this LGTM. Thanks!

Actions #12

Updated by Radhika Chippada almost 7 years ago

Branch 11906-wb-ping @ ffc117651144b197c9d650039a5403e972d54b7c

Adds _health/ping to workbench.

I also made small updates to API server implementation. The code merged in 11906-api-ping was returning responses of the format
{"errors":["authorization required"],"error_token":"1500405563+a2c23148"}. Updated to return {"errors":"authorization required"} instead.

Test run @ https://ci.curoverse.com/job/developer-run-tests/386/

Actions #13

Updated by Lucas Di Pentima over 6 years ago

Updates at ffc117651144b197c9d650039a5403e972d54b7c LGTM, just one tiny detail:

Line 20 at file apps/workbench/app/controllers/healthcheck_controller.rb has its indentation a bit off.

Actions #14

Updated by Radhika Chippada over 6 years ago

  • Target version changed from 2017-07-19 sprint to 2017-08-02 sprint
Actions #15

Updated by Radhika Chippada over 6 years ago

  • Assigned To set to Radhika Chippada
Actions #16

Updated by Tom Clegg over 6 years ago

Actions #17

Updated by Radhika Chippada over 6 years ago

Branch 11906-health-check-lib:

  • In sdk handler.go, can we can call “type Func” as “type HealthFunc” or something to help improve readability. I am thinking if we go back to this code after 6 months, it might be a bit confusing
  • The comment “Map of URI paths to health-check Func. The prefix is omitted: Routes["foo"] is the health check invoked by a request to “/_health/foo” => Here should it say a request to “{prefix}/foo” instead?
  • In the setup func, is it safe to assume that the prefix always starts with “/“ or do we want to ensure it similar to HasSuffix handling? (I just tested with “_health” as prefix and the test failed)
  • Do we want to add a test with wrong prefix as well in sdk handler_test.go? h.ServeHTTP(resp, s.request(“/nosuchprefix/ping", goodToken))
  • TestZeroValue => TestNotConfigured?
Actions #18

Updated by Tom Clegg over 6 years ago

Radhika Chippada wrote:

  • In sdk handler.go, can we can call “type Func” as “type HealthFunc” or something to help improve readability. I am thinking if we go back to this code after 6 months, it might be a bit confusing

https://blog.golang.org/package-names says "avoid stutter": like http.Server, this is really health.Func.

  • The comment “Map of URI paths to health-check Func. The prefix is omitted: Routes["foo"] is the health check invoked by a request to “/_health/foo” => Here should it say a request to “{prefix}/foo” instead?

Fixed.

  • In the setup func, is it safe to assume that the prefix always starts with “/“ or do we want to ensure it similar to HasSuffix handling? (I just tested with “_health” as prefix and the test failed)

I think the caller is already responsible for building an http.Server and must know what paths look like in that context. I'm more inclined to remove the special suffix-adding behavior if anything. The example makes it clear what callers should do, and in the long run auto-fixing things just means committing to a more subtle API. Might as well just force all callers to call the same way right from the start.

  • Do we want to add a test with wrong prefix as well in sdk handler_test.go? h.ServeHTTP(resp, s.request(“/nosuchprefix/ping", goodToken))

Added.

  • TestZeroValue => TestNotConfigured?

TestZeroValueIsDisabled. And added another check.

11906-health-check-lib @ da037c1372879cb0ebc221c42ebdb86af669295b

Actions #19

Updated by Radhika Chippada over 6 years ago

11906-health-check-lib LGTM. Thanks.

Actions #20

Updated by Lucas Di Pentima over 6 years ago

Updates at 91f976b99d4e7f00d2c1fbfee75812c3b1b780c6 - branch 11906-keepproxy-ping LGTM, thanks!

Actions #21

Updated by Radhika Chippada over 6 years ago

Branch 11906-keep-web-ping @ 1af8267a0d11ad9f986aec5ec7853299f3cfeb0e implements /_health/ping for keep-web.

Test run @ https://ci.curoverse.com/job/developer-run-tests/398/

Actions #22

Updated by Tom Clegg over 6 years ago

I pushed a commit removing the func and servemux, and calling the health handler's ServeHTTP() directly instead.

The rest LGTM, if this looks OK to you.

11906-keep-web-ping @ 2d16a410f0121806bf847f9e6290dae66e96fc5f

Actions #23

Updated by Lucas Di Pentima over 6 years ago

Updates at dc929c778 - branch 11906-arv-git-httpd-ping LGTM, please merge.

Actions #24

Updated by Lucas Di Pentima over 6 years ago

Reviewed updates at 8ae04f9f4d252a99985d7ac047413318d2f1068b for nodemanager. Looks good to me, thanks!

Actions #25

Updated by Radhika Chippada over 6 years ago

  • Status changed from In Progress to Resolved

Applied in changeset arvados|commit:dfc4c2c74dc7a31ff0c0d307846b7e9525fdcf0e.

Actions

Also available in: Atom PDF