Feature #12260

Healthcheck endpoint aggregator

Added by Tom Morris about 1 month ago. Updated 3 days ago.

Status:In ProgressStart date:10/10/2017
Priority:NormalDue date:
Assignee:Tom Clegg% Done:

0%

Category:-
Target version:2017-10-25 Sprint
Story points1.0Remaining (hours)0.00 hour
Velocity based estimate0 days

Subtasks

Task #12362: Review 12260-system-healthIn ProgressLucas Di Pentima


Related issues

Related to Arvados - Feature #11906: Basic authenticated http health check ("ping") for each s... Resolved 07/17/2017
Related to Arvados - Story #11748: Standard convention for Nagios-compatible health check sc... New

Associated revisions

Revision e14975a4
Added by Tom Clegg 2 days ago

Merge branch '12260-system-health'

refs #12260

Arvados-DCO-1.1-Signed-off-by: Tom Clegg <>

Revision 2e087bc0
Added by Peter Amstutz 1 day ago

Comment out sdk/go/health from run-tests.sh until it is fixed.

refs #12260

Arvados-DCO-1.1-Signed-off-by: Peter Amstutz <>

History

#1 Updated by Tom Clegg about 1 month ago

Authenticated externally-visible endpoint that reports all of the health checks for all services that should be running
listen on configured host:port (default :9201)
  • respond to "GET /_health/all"
  • require admin token in "Authorization: Bearer XXX" header (return 500 if token can't be checked because API is down)
  • concurrently call /_health/ping on all services that should be up
  • when all ping responses have arrived/failed, respond to client with {"health":"OK"} if all succeeded, otherwise {"health":"ERROR"}
  • also include detail of each check, using key "{node}/{component}/{health-check-path}":
    • { "health":"OK",
        "checks":{
          "keep/keep-web/_health/ping":{"health":"OK","responseTime":0.00123,"status":200},
          "keep0/keepstore/_health/ping":{"health":"OK","responseTime":0.00123,"status":200},
          ...
        }
      }
      
  • (?) allow for multiple instances of a service (e.g., keepstore) on a single node
  • report missing services as failed
    • in general, the system is healthy if there are enough healthy instances of each component/service
    • for now, call 1 "enough" for all services
configuration:
  • /etc/arvados/config.yml is a cluster configuration
    • minimum for this story: cluster id (uuid prefix), port numbers (if different from install guide), map of hostname → services ("SystemNodes")
  • type SiteConfig struct in sdk/go/arvados
  • example /etc/arvados.yml:
    Clusters:
      qr1hi:
        ManagementToken: s3cr3t
        SystemNodes:
          keep0:
            Keepstore:
              Listen: :25107
          keep1:
            Keepstore:
              Listen: :25107
    
implementation:
  • package health (lib/health)
  • health.Aggregator implements http.Handler
  • health.Aggregator has an arvados.SiteConfig field
  • package main (services/health → arvados-health_*.deb → /usr/bin/arvados-health)
  • main starts http server at (SiteConfig)CurrentSystemNode().Health.Port and attaches a health.Aggregator{SiteConfig: arvados.LoadSiteConfig(nil)}

#2 Updated by Tom Clegg about 1 month ago

  • Subject changed from Healthcheck / status endpoint aggregator to Healthcheck endpoint aggregator

#3 Updated by Tom Morris about 1 month ago

  • Target version changed from Arvados Future Sprints to 2017-10-11 Sprint
  • Story points set to 2.0

#4 Updated by Tom Clegg 22 days ago

  • Assignee set to Tom Clegg

#5 Updated by Tom Clegg 21 days ago

  • Status changed from New to In Progress

#7 Updated by Tom Clegg 9 days ago

12260-system-health @ a9497f8d2756104ba07d88d5c8c7b84790fd83f3

Known todos:
  • Update package scripts to build deb/rpm packages ("arvados-health")
  • Update install guide (or maybe wait until we have seen it work in real life?)

#8 Updated by Tom Clegg 8 days ago

  • Target version changed from 2017-10-11 Sprint to 2017-10-25 Sprint

#9 Updated by Tom Clegg 8 days ago

  • Story points changed from 2.0 to 1.0

#10 Updated by Lucas Di Pentima 7 days ago

Some comments/questions:

  • File sdk/go/arvados/config.go
    • Lines 53 & 63: Comments seem to be outdated naming funcs & types that aren’t named like that
    • Line 60: The fact that GetSystemNode() returns (*SystemNode, error) is enough to avoid explicitly returning 'error' there?
  • File sdk/go/health/aggregator.go
    • Lines 143 & 146: Is the re-assignment necessary? (I ask because I remember doing some similar test and it seemed to be assigning by reference, not copy)

#11 Updated by Tom Clegg 3 days ago

Lucas Di Pentima wrote:

Some comments/questions:

  • File sdk/go/arvados/config.go
    • Lines 53 & 63: Comments seem to be outdated naming funcs & types that aren’t named like that

Fixed

  • Line 60: The fact that GetSystemNode() returns (*SystemNode, error) is enough to avoid explicitly returning 'error' there?

Yes. In fact, if GetSystemNode(x) only returned a single value, "return GetSystemNode(x)" wouldn't compile ("not enough values to return").

(You can use the same shortcut with function args: if you have a func foo(*SystemNode, error) you can call foo(GetSystemNode(x)).

  • File sdk/go/health/aggregator.go
    • Lines 143 & 146: Is the re-assignment necessary? (I ask because I remember doing some similar test and it seemed to be assigning by reference, not copy)

Yes. You can read or write from a map, but there's no "update" syntax. If resp.Services were a map[string]*ServiceHealth then you could read a pointer and update the object it points to: "resp.Services[svc].N++". But it's a map[string]ServiceHealth, so we have to do read-update-write explicitly.

12260-system-health @ ff100fbf824e2dbc2ff0afd3d746ac562532cfb6

#12 Updated by Lucas Di Pentima 3 days ago

LGTM. Thanks.

Also available in: Atom PDF