Project

General

Profile

Cluster configuration » History » Version 18

Tom Clegg, 01/23/2019 03:19 PM

1 1 Tom Clegg
h1. Cluster configuration
2
3 18 Tom Clegg
We are (2019) consolidating configuration from per-microservice yaml/json/ini files into a single cluster configuration document that is used by all components.
4 1 Tom Clegg
* Long term: system nodes automatically keep their configs synchronized (using something like consul).
5
* Short term: sysadmin uses tools like puppet and terraform to ensure /etc/arvados/config.yml is identical on all system nodes.
6
* Hosts without config files (e.g., hosts outside the cluster) can retrieve the config document from the API server.
7
8
h2. Discovery document
9
10
Previously, we copied selected config values from the API server config into the API discovery document so clients could see them. When clients can get the configuration document itself, this won't be needed. The discovery document should advertise APIs provided by the server, not cluster configuration.
11
12 7 Tom Clegg
h2. Secrets
13
14
Secrets like BlobSigningKey can be given literally in the config file (convenient for dev/test, consul-template, etc) or indirectly using a secret backend. Anticipated backends:
15
* <code class="yaml">BlobSigningKey: foobar</code> &rArr; the secret is literally <code>foobar</code>
16
* <code class="yaml">BlobSigningKey: "vault:foobar"</code> &rArr; the secret can be obtained from vault using the vault key "foobar"
17
* <code class="yaml">BlobSigningKey: "file:/foobar"</code> &rArr; the secret can be read from the local file @/foobar@
18
* <code class="yaml">BlobSigningKey: "env:FOOBAR"</code> &rArr; the secret can be read from the environment variable @FOOBAR@
19
20 1 Tom Clegg
h2. Example config file
21
22
(Format not yet frozen!)
23
24
<pre><code class="yaml">
25
Clusters:
26
  xyzzy:
27 16 Tom Clegg
    ManagementToken: eec1999ccb6d75840a2c09bc70b6d3cbc990744e
28 1 Tom Clegg
    BlobSigningKey: ungu355able
29
    BlobSignatureTTL: 172800
30 6 Tom Clegg
    SessionKey: 186005aa54cab1ca95a3738e6e954e0a35a96d3d13a8ea541f4156e8d067b4f3
31 4 Tom Clegg
    PostgreSQL:
32 11 Tom Clegg
      ConnectionPool: 32 # max concurrent connections per arvados server daemon
33 10 Tom Clegg
      Connection:
34
        # All parameters here are passed to the PG client library in a connection string;
35
        # see https://www.postgresql.org/docs/current/static/libpq-connect.html#LIBPQ-PARAMKEYWORDS
36
        Host: localhost
37
        Port: 5432
38
        User: arvados
39
        Password: s3cr3t
40
        DBName: arvados_production
41
        client_encoding: utf8
42
        fallback_application_name: arvados
43 4 Tom Clegg
    HTTPRequestTimeout: 5m
44 6 Tom Clegg
    Defaults:
45
      CollectionReplication: 2
46
      TrashLifetime: 2w
47
    UserActivation:
48
      ActivateNewUsers: true
49
      AutoAdminUser: root@example.com
50
      UserProfileNotificationAddress: notify@example.com
51 8 Tom Clegg
      NewUserNotificationRecipients: {}
52
      NewInactiveUserNotificationRecipients: {}
53 15 Tom Clegg
    RequestLimits:
54 6 Tom Clegg
      MaxRequestLogParamsSize: 2KB
55
      MaxRequestSize: 128MiB
56
      MaxIndexDatabaseRead: 128MiB
57 1 Tom Clegg
      MaxItemsPerResponse: 1000
58 15 Tom Clegg
      MultiClusterRequestConcurrency: 4
59 14 Tom Clegg
    LogLevel: info
60
    CloudVMs:
61 17 Tom Clegg
      BootProbeCommand: "docker ps -q"
62
      SSHPort: 22
63
      SyncInterval: 1m    # how often to get list of active instances from cloud provider
64
      TimeoutIdle: 1m     # shutdown if idle longer than this
65
      TimeoutBooting: 10m # shutdown if exists longer than this without running BootProbeCommand successfully
66
      TimeoutProbe: 2m    # shutdown if (after booting) communication fails longer than this, even if ctrs are running
67
      TimeoutShutdown: 1m # shutdown again if node still exists this long after shutdown
68 1 Tom Clegg
      Driver: Amazon
69 14 Tom Clegg
      DriverParameters:
70
        Region: us-east-1
71
        APITimeout: 20s
72 17 Tom Clegg
        AWSAccessKeyID: abcdef
73
        AWSSecretAccessKey: abcdefghijklmnopqrstuvwxyz
74 14 Tom Clegg
        ImageID: ami-0a01b48b88d14541e
75
        SubnetID: subnet-24f5ae62
76
        SecurityGroups: sg-3ec53e2a
77 13 Lucas Di Pentima
    AuditLogs:
78
      MaxAge: 2w
79 6 Tom Clegg
      DeleteBatchSize: 100000
80
      UnloggedAttributes: {} # example: {"manifest_text": true}
81
    ContainerLogStream:
82 8 Tom Clegg
      BatchSize: 4KiB
83 6 Tom Clegg
      BatchTime: 1s
84
      ThrottlePeriod: 1m
85
      ThrottleThresholdSize: 64KiB
86
      ThrottleThresholdLines: 1024
87
      TruncateSize: 64MiB
88
      PartialLineThrottlePeriod: 5s
89
    Timers:
90
      TrashSweepInterval: 60s
91 14 Tom Clegg
      ContainerDispatchPollInterval: 10s
92
      APIRequestTimeout: 20s
93 6 Tom Clegg
    Scaling:
94
      MaxComputeNodes: 64
95
      EnablePreemptibleInstances: false
96 8 Tom Clegg
    DisableAPIMethods: {} # example: {"jobs.create": true}
97
    DockerImageFormats: {"v2": true}
98 6 Tom Clegg
    Crunch1:
99
      Enable: true
100
      CrunchJobWrapper: none
101
      CrunchJobUser: crunch
102 12 Tom Clegg
      CrunchRefreshTrigger: /tmp/crunch_refresh_trigger
103 6 Tom Clegg
      DefaultDockerImage: false
104 4 Tom Clegg
    NodeProfiles:
105
      # Key is a profile name; can be specified on service prog command line, defaults to $(hostname)
106
      keep:
107
        # Don’t run other services automatically -- only specified ones
108
        Default: {Disable: true}
109
        Keepstore: {Listen: ":25107"}
110
      apiserver:
111
        Default: {Disable: true}
112
        RailsAPI: {Listen: ":9000", TLS: true}
113
        Controller: {Listen: ":9100"}
114 1 Tom Clegg
        Websocket: {Listen: ":9101"}
115
        Health: {Listen: ":9199"}
116
      keep:
117
        Default: {Disable: true}
118
        KeepProxy: {Listen: ":9102"}
119
        KeepWeb: {Listen: ":9103"}
120
      *:
121
        # This section used for a node whose profile name is not listed above
122 13 Lucas Di Pentima
        Default: {Disable: false} # (this is the default behavior)
123
    Volumes:
124
      xyzzy-keep-0:
125
        Type: s3
126
        Region: us-east
127
        Bucket: xyzzy-keep-0
128
        # [rest of keepstore volume config goes here]
129 4 Tom Clegg
    WebRoutes:
130 5 Tom Clegg
      # “default” means route according to method/host/path (e.g., if host is a login shell, route there)
131 4 Tom Clegg
      xyzzy.arvadosapi.com: default
132
      # “collections” means always route to keep-web
133
      collections.xyzzy.arvadosapi.com: collections
134
      # leading * is a wildcard (longest match wins)
135
      "*--collections.xyzzy.arvadosapi.com": collections
136
      cloud.curoverse.com: workbench
137
      workbench.xyzzy.arvadosapi.com: workbench
138
      "*.xyzzy.arvadosapi.com": default
139 3 Tom Clegg
    InstanceTypes:
140 8 Tom Clegg
      m4.large:
141
        VCPUs: 2
142
        RAM: 8000000000
143
        Scratch: 31000000000
144
        Price: 0.1
145
      m4.large-1t:
146
        # same instance type as m4.large but our scripts attach more scratch
147
        ProviderType: m4.large
148
        VCPUs: 2
149
        RAM: 8000000000
150
        Scratch: 999000000000
151
        Price: 0.12
152
      m4.xlarge:
153
        VCPUs: 4
154
        RAM: 16000000000
155
        Scratch: 78000000000
156
        Price: 0.2
157
      m4.8xlarge:
158
        VCPUs: 40
159
        RAM: 160000000000
160
        Scratch: 156000000000
161
        Price: 2
162
      m4.16xlarge:
163
        VCPUs: 64
164
        RAM: 256000000000
165
        Scratch: 310000000000
166
        Price: 3.2
167
      c4.large:
168
        VCPUs: 2
169
        RAM: 3750000000
170
        Price: 0.1
171
      c4.8xlarge:
172
        VCPUs: 36
173
        RAM: 60000000000
174
        Price: 1.591
175 9 Tom Clegg
    RemoteClusters:
176
      xrrrr:
177
        Host: xrrrr.arvadosapi.com
178
        Proxy: true        # proxy requests to xrrrr on behalf of our clients
179
        AuthProvider: true # users authenticated by xrrrr can use our cluster
180 1 Tom Clegg
</code></pre>