Project

General

Profile

Cluster configuration » History » Version 20

Tom Clegg, 01/23/2019 05:57 PM

1 1 Tom Clegg
h1. Cluster configuration
2
3 18 Tom Clegg
We are (2019) consolidating configuration from per-microservice yaml/json/ini files into a single cluster configuration document that is used by all components.
4 1 Tom Clegg
* Long term: system nodes automatically keep their configs synchronized (using something like consul).
5
* Short term: sysadmin uses tools like puppet and terraform to ensure /etc/arvados/config.yml is identical on all system nodes.
6
* Hosts without config files (e.g., hosts outside the cluster) can retrieve the config document from the API server.
7
8
h2. Discovery document
9
10
Previously, we copied selected config values from the API server config into the API discovery document so clients could see them. When clients can get the configuration document itself, this won't be needed. The discovery document should advertise APIs provided by the server, not cluster configuration.
11
12 7 Tom Clegg
h2. Secrets
13
14
Secrets like BlobSigningKey can be given literally in the config file (convenient for dev/test, consul-template, etc) or indirectly using a secret backend. Anticipated backends:
15
* <code class="yaml">BlobSigningKey: foobar</code> &rArr; the secret is literally <code>foobar</code>
16
* <code class="yaml">BlobSigningKey: "vault:foobar"</code> &rArr; the secret can be obtained from vault using the vault key "foobar"
17
* <code class="yaml">BlobSigningKey: "file:/foobar"</code> &rArr; the secret can be read from the local file @/foobar@
18
* <code class="yaml">BlobSigningKey: "env:FOOBAR"</code> &rArr; the secret can be read from the environment variable @FOOBAR@
19
20 19 Tom Clegg
h2. Implementation
21
22
Development strategy for switching config file format/location in an operator-friendly way:
23
# Read the new config file into an internal struct, if the new config file exists.
24
# Copy old config file values into the new config struct.
25
# Use the new config struct internally (the old config is no longer referenced except in the load-and-copy-to-new-struct step).
26
# Add a mechanism for dumping the new config struct at startup/runtime after loading both new and old configs.
27
# Add a mechanism for reporting that some parts of the old config are not redundant, i.e., haven't been migrated to the new config file by the operator. [optional?]
28
# Wait one minor version release cycle.
29
# Error out if the new config file does not exist.
30
# Error out if the old config file exists (...and some parts of the old config are not redundant [optional?]).
31
32 1 Tom Clegg
h2. Example config file
33
34
(Format not yet frozen!)
35
36 20 Tom Clegg
Notes:
37
* Keys are CamelCase &mdash; except in special cases like PostgreSQL connection settings, which are passed through to another system without being interpreted by Arvados.
38
* Arrays and lists are not permitted. These cannot be expressed natively in consul, and tend to be troublesome anyway: "what changed?" is harder to answer usefully, significance of duplicate elements is unclear, etc.
39
40 1 Tom Clegg
<pre><code class="yaml">
41
Clusters:
42
  xyzzy:
43 16 Tom Clegg
    ManagementToken: eec1999ccb6d75840a2c09bc70b6d3cbc990744e
44 1 Tom Clegg
    BlobSigningKey: ungu355able
45
    BlobSignatureTTL: 172800
46 6 Tom Clegg
    SessionKey: 186005aa54cab1ca95a3738e6e954e0a35a96d3d13a8ea541f4156e8d067b4f3
47 4 Tom Clegg
    PostgreSQL:
48 11 Tom Clegg
      ConnectionPool: 32 # max concurrent connections per arvados server daemon
49 10 Tom Clegg
      Connection:
50
        # All parameters here are passed to the PG client library in a connection string;
51
        # see https://www.postgresql.org/docs/current/static/libpq-connect.html#LIBPQ-PARAMKEYWORDS
52
        Host: localhost
53
        Port: 5432
54
        User: arvados
55
        Password: s3cr3t
56
        DBName: arvados_production
57
        client_encoding: utf8
58
        fallback_application_name: arvados
59 4 Tom Clegg
    HTTPRequestTimeout: 5m
60 6 Tom Clegg
    Defaults:
61
      CollectionReplication: 2
62
      TrashLifetime: 2w
63
    UserActivation:
64
      ActivateNewUsers: true
65
      AutoAdminUser: root@example.com
66
      UserProfileNotificationAddress: notify@example.com
67 8 Tom Clegg
      NewUserNotificationRecipients: {}
68
      NewInactiveUserNotificationRecipients: {}
69 15 Tom Clegg
    RequestLimits:
70 6 Tom Clegg
      MaxRequestLogParamsSize: 2KB
71
      MaxRequestSize: 128MiB
72
      MaxIndexDatabaseRead: 128MiB
73 1 Tom Clegg
      MaxItemsPerResponse: 1000
74 15 Tom Clegg
      MultiClusterRequestConcurrency: 4
75 14 Tom Clegg
    LogLevel: info
76
    CloudVMs:
77 17 Tom Clegg
      BootProbeCommand: "docker ps -q"
78
      SSHPort: 22
79
      SyncInterval: 1m    # how often to get list of active instances from cloud provider
80
      TimeoutIdle: 1m     # shutdown if idle longer than this
81
      TimeoutBooting: 10m # shutdown if exists longer than this without running BootProbeCommand successfully
82
      TimeoutProbe: 2m    # shutdown if (after booting) communication fails longer than this, even if ctrs are running
83
      TimeoutShutdown: 1m # shutdown again if node still exists this long after shutdown
84 1 Tom Clegg
      Driver: Amazon
85 14 Tom Clegg
      DriverParameters:
86
        Region: us-east-1
87
        APITimeout: 20s
88 17 Tom Clegg
        AWSAccessKeyID: abcdef
89
        AWSSecretAccessKey: abcdefghijklmnopqrstuvwxyz
90 14 Tom Clegg
        ImageID: ami-0a01b48b88d14541e
91
        SubnetID: subnet-24f5ae62
92
        SecurityGroups: sg-3ec53e2a
93 13 Lucas Di Pentima
    AuditLogs:
94
      MaxAge: 2w
95 6 Tom Clegg
      DeleteBatchSize: 100000
96
      UnloggedAttributes: {} # example: {"manifest_text": true}
97
    ContainerLogStream:
98 8 Tom Clegg
      BatchSize: 4KiB
99 6 Tom Clegg
      BatchTime: 1s
100
      ThrottlePeriod: 1m
101
      ThrottleThresholdSize: 64KiB
102
      ThrottleThresholdLines: 1024
103
      TruncateSize: 64MiB
104
      PartialLineThrottlePeriod: 5s
105
    Timers:
106
      TrashSweepInterval: 60s
107 14 Tom Clegg
      ContainerDispatchPollInterval: 10s
108
      APIRequestTimeout: 20s
109 6 Tom Clegg
    Scaling:
110
      MaxComputeNodes: 64
111
      EnablePreemptibleInstances: false
112 8 Tom Clegg
    DisableAPIMethods: {} # example: {"jobs.create": true}
113
    DockerImageFormats: {"v2": true}
114 6 Tom Clegg
    Crunch1:
115
      Enable: true
116
      CrunchJobWrapper: none
117
      CrunchJobUser: crunch
118 12 Tom Clegg
      CrunchRefreshTrigger: /tmp/crunch_refresh_trigger
119 6 Tom Clegg
      DefaultDockerImage: false
120 4 Tom Clegg
    NodeProfiles:
121
      # Key is a profile name; can be specified on service prog command line, defaults to $(hostname)
122
      keep:
123
        # Don’t run other services automatically -- only specified ones
124
        Default: {Disable: true}
125
        Keepstore: {Listen: ":25107"}
126
      apiserver:
127
        Default: {Disable: true}
128
        RailsAPI: {Listen: ":9000", TLS: true}
129
        Controller: {Listen: ":9100"}
130 1 Tom Clegg
        Websocket: {Listen: ":9101"}
131
        Health: {Listen: ":9199"}
132
      keep:
133
        Default: {Disable: true}
134
        KeepProxy: {Listen: ":9102"}
135
        KeepWeb: {Listen: ":9103"}
136
      *:
137
        # This section used for a node whose profile name is not listed above
138 13 Lucas Di Pentima
        Default: {Disable: false} # (this is the default behavior)
139
    Volumes:
140
      xyzzy-keep-0:
141
        Type: s3
142
        Region: us-east
143
        Bucket: xyzzy-keep-0
144
        # [rest of keepstore volume config goes here]
145 4 Tom Clegg
    WebRoutes:
146 5 Tom Clegg
      # “default” means route according to method/host/path (e.g., if host is a login shell, route there)
147 4 Tom Clegg
      xyzzy.arvadosapi.com: default
148
      # “collections” means always route to keep-web
149
      collections.xyzzy.arvadosapi.com: collections
150
      # leading * is a wildcard (longest match wins)
151
      "*--collections.xyzzy.arvadosapi.com": collections
152
      cloud.curoverse.com: workbench
153
      workbench.xyzzy.arvadosapi.com: workbench
154
      "*.xyzzy.arvadosapi.com": default
155 3 Tom Clegg
    InstanceTypes:
156 8 Tom Clegg
      m4.large:
157
        VCPUs: 2
158
        RAM: 8000000000
159
        Scratch: 31000000000
160
        Price: 0.1
161
      m4.large-1t:
162
        # same instance type as m4.large but our scripts attach more scratch
163
        ProviderType: m4.large
164
        VCPUs: 2
165
        RAM: 8000000000
166
        Scratch: 999000000000
167
        Price: 0.12
168
      m4.xlarge:
169
        VCPUs: 4
170
        RAM: 16000000000
171
        Scratch: 78000000000
172
        Price: 0.2
173
      m4.8xlarge:
174
        VCPUs: 40
175
        RAM: 160000000000
176
        Scratch: 156000000000
177
        Price: 2
178
      m4.16xlarge:
179
        VCPUs: 64
180
        RAM: 256000000000
181
        Scratch: 310000000000
182
        Price: 3.2
183
      c4.large:
184
        VCPUs: 2
185
        RAM: 3750000000
186
        Price: 0.1
187
      c4.8xlarge:
188
        VCPUs: 36
189
        RAM: 60000000000
190
        Price: 1.591
191 9 Tom Clegg
    RemoteClusters:
192
      xrrrr:
193
        Host: xrrrr.arvadosapi.com
194
        Proxy: true        # proxy requests to xrrrr on behalf of our clients
195
        AuthProvider: true # users authenticated by xrrrr can use our cluster
196 1 Tom Clegg
</code></pre>