Project

General

Profile

Actions

Feature #12260

closed

Healthcheck endpoint aggregator

Added by Tom Morris over 5 years ago. Updated almost 5 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Start date:
10/10/2017
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
-

Description


Subtasks 1 (0 open1 closed)

Task #12362: Review 12260-system-healthResolvedLucas Di Pentima10/10/2017

Actions

Related issues

Related to Arvados - Feature #11906: Basic authenticated http health check ("ping") for each system serviceResolvedRadhika Chippada07/17/2017

Actions
Related to Arvados - Story #11748: Standard convention for Nagios-compatible health check scripts kept alongside source for servicesRejected

Actions
Actions #1

Updated by Tom Clegg over 5 years ago

Authenticated externally-visible endpoint that reports all of the health checks for all services that should be running
listen on configured host:port (default :9201)
  • respond to "GET /_health/all"
  • require admin token in "Authorization: Bearer XXX" header (return 500 if token can't be checked because API is down)
  • concurrently call /_health/ping on all services that should be up
  • when all ping responses have arrived/failed, respond to client with {"health":"OK"} if all succeeded, otherwise {"health":"ERROR"}
  • also include detail of each check, using key "{node}/{component}/{health-check-path}":
    • { "health":"OK",
        "checks":{
          "keep/keep-web/_health/ping":{"health":"OK","responseTime":0.00123,"status":200},
          "keep0/keepstore/_health/ping":{"health":"OK","responseTime":0.00123,"status":200},
          ...
        }
      }
      
  • (?) allow for multiple instances of a service (e.g., keepstore) on a single node
  • report missing services as failed
    • in general, the system is healthy if there are enough healthy instances of each component/service
    • for now, call 1 "enough" for all services
configuration:
  • /etc/arvados/config.yml is a cluster configuration
    • minimum for this story: cluster id (uuid prefix), port numbers (if different from install guide), map of hostname → services ("SystemNodes")
  • type SiteConfig struct in sdk/go/arvados
  • example /etc/arvados/config.yml:
    Clusters:
      qr1hi:
        ManagementToken: s3cr3t
        SystemNodes:
          keep0:
            Keepstore:
              Listen: :25107
          keep1:
            Keepstore:
              Listen: :25107
    
implementation:
  • package health (lib/health)
  • health.Aggregator implements http.Handler
  • health.Aggregator has an arvados.SiteConfig field
  • package main (services/health → arvados-health_*.deb → /usr/bin/arvados-health)
  • main starts http server at (SiteConfig)CurrentSystemNode().Health.Port and attaches a health.Aggregator{SiteConfig: arvados.LoadSiteConfig(nil)}
Actions #2

Updated by Tom Clegg over 5 years ago

  • Subject changed from Healthcheck / status endpoint aggregator to Healthcheck endpoint aggregator
Actions #3

Updated by Tom Morris over 5 years ago

  • Target version changed from Arvados Future Sprints to 2017-10-11 Sprint
  • Story points set to 2.0
Actions #4

Updated by Tom Clegg over 5 years ago

  • Assigned To set to Tom Clegg
Actions #5

Updated by Tom Clegg over 5 years ago

  • Status changed from New to In Progress
Actions #7

Updated by Tom Clegg over 5 years ago

12260-system-health @ a9497f8d2756104ba07d88d5c8c7b84790fd83f3

Known todos:
  • Update package scripts to build deb/rpm packages ("arvados-health")
  • Update install guide (or maybe wait until we have seen it work in real life?)
Actions #8

Updated by Tom Clegg over 5 years ago

  • Target version changed from 2017-10-11 Sprint to 2017-10-25 Sprint
Actions #9

Updated by Tom Clegg over 5 years ago

  • Story points changed from 2.0 to 1.0
Actions #10

Updated by Lucas Di Pentima over 5 years ago

Some comments/questions:

  • File sdk/go/arvados/config.go
    • Lines 53 & 63: Comments seem to be outdated naming funcs & types that aren’t named like that
    • Line 60: The fact that GetSystemNode() returns (*SystemNode, error) is enough to avoid explicitly returning 'error' there?
  • File sdk/go/health/aggregator.go
    • Lines 143 & 146: Is the re-assignment necessary? (I ask because I remember doing some similar test and it seemed to be assigning by reference, not copy)
Actions #11

Updated by Tom Clegg over 5 years ago

Lucas Di Pentima wrote:

Some comments/questions:

  • File sdk/go/arvados/config.go
    • Lines 53 & 63: Comments seem to be outdated naming funcs & types that aren’t named like that

Fixed

  • Line 60: The fact that GetSystemNode() returns (*SystemNode, error) is enough to avoid explicitly returning 'error' there?

Yes. In fact, if GetSystemNode(x) only returned a single value, "return GetSystemNode(x)" wouldn't compile ("not enough values to return").

(You can use the same shortcut with function args: if you have a func foo(*SystemNode, error) you can call foo(GetSystemNode(x)).

  • File sdk/go/health/aggregator.go
    • Lines 143 & 146: Is the re-assignment necessary? (I ask because I remember doing some similar test and it seemed to be assigning by reference, not copy)

Yes. You can read or write from a map, but there's no "update" syntax. If resp.Services were a map[string]*ServiceHealth then you could read a pointer and update the object it points to: "resp.Services[svc].N++". But it's a map[string]ServiceHealth, so we have to do read-update-write explicitly.

12260-system-health @ ff100fbf824e2dbc2ff0afd3d746ac562532cfb6

Actions #12

Updated by Lucas Di Pentima over 5 years ago

LGTM. Thanks.

Actions #13

Updated by Tom Clegg over 5 years ago

  • Status changed from In Progress to Feedback
Actions #14

Updated by Tom Morris over 5 years ago

  • Target version changed from 2017-10-25 Sprint to 2017-11-08 Sprint
Actions #15

Updated by Tom Clegg about 5 years ago

  • Target version changed from 2017-11-08 Sprint to 2017-11-22 Sprint
  • Story points changed from 1.0 to 0.0
Actions #16

Updated by Tom Clegg about 5 years ago

  • Status changed from Feedback to Resolved
Actions #17

Updated by Tom Clegg almost 5 years ago

  • Description updated (diff)
  • Story points deleted (0.0)
Actions

Also available in: Atom PDF