Project

General

Profile

Auto-discovery » History » Version 1

Ward Vandewege, 10/18/2021 07:31 PM

1 1 Ward Vandewege
h1. Auto-discovery
2
3
See #18256
4
5
h2. Goals
6
7
* remove need for config file presence on every node
8
* autodiscovery of arvados services
9
10
h2. Status quo:
11
12
* every node that runs a permanent arvados service needs (a subset of) the config.yml file present on the filesystem
13
* for simplicity, we recommend that the entire config file is present on each such machine
14
* the non-private parts of the config are exposed on controller via an api
15
16
h2. Problems with the status quo:
17
18
* it can be difficult to keep the config file in sync across nodes (requires configuration management to ensure it)
19
* it is not ideal from a security prespective that the entire config file (with all its secrets) is installed on each node, even when only a subset is required. This is particularly the case when the node is not 100% trusted, e.g. a compute node image with local keepstore (cf. #16347) when Docker is the runner.
20
21
h2. Desired functionality:
22
23
* arvados services can request configuration values from controller (by key)
24
* for the secret parts of the config, some form of authentication is required
25
* arvados services can register themselves with controller. E.g. when arvados-ws is started up, it would discover the controller (see below) and make itself known as an arvados-ws service and request the necessary configuration keys.
26
27
h2. Discovery process:
28
29
We could adopt something similar to how Puppet et al do discovery: a well-known dns name that is polled until it is reachable, after which the necessary config is retrieved and cached locally. E.g. each service could try to reach "arvados" on port 443, though we'd probably want to do that with ARVADOS_API_HOST_INSECURE set.  That seems sucky.
30
31
So how about this: the client does an http "join" request to 'arvados' on port 80 which responds with the fqdn of controller (https, just a 301)? That way we get transport level security and automatic discovery. The payload is a json object that has these fields: fqdn, local ip address -- which one?, service name e.g. arvados-ws, and optionally a pre-shared key (PSK). The client keeps repeating the join request every few seconds, polling, until it is accepted or rejected.
32
33
Controller has an API to list/accept/remove "join" requests, again a la Puppet. An arvados-server command could give cli access to that command.
34
35
Controller can issue a PSK it accepts for automatic joining (handy in tests, etc). A service could be given that PSK on startup (e.g. via an env var).
36
37
Controller could have a "discovery" mode where it automatically accepts all "join" requests (this could be very handy for automated testing).
38
39
When controller is instructed to "accept" a join request, controller issues a service secret to the service, which writes it to a file on disk (in /etc/arvados/ ?). Service then requests relevant config. It should probably be cached on disk -- we don't want to make every service dependent on the availability of the controller at all times. The service could occasionally try to refresh the config if the cache copy is too old, but fall back to the cached copy if it can't reach controller.
40
41
h2. Puppet discovery process (for background reference):
42
43
Puppet does this by giving the server its own CA, which generates its own cert with
44
45
```
46
X509v3 Subject Alternative Name:
47
                DNS:puppet, DNS:$fqdn, DNS:puppet.$domain
48
```
49
50
On first connect, the client generates a CSR and private key, gets the CA public cert and sends the CSR to the server. Administrator on the server approves CSR, until then the client polls, and when the CSR is approved, the signed cert is sent back to client, which then uses that to communicate with the master.
51
52
For more information see https://www.masterzen.fr/2010/11/14/puppet-ssl-explained/
53
54
h2. Future work (not part of this design):
55
56
* controller automatically updates the config and adds/removes services as they register/deregister with it