Feature #21383
closedUpdate Salt installer to support Debian 12
Description
See Salt issue #61064. This fix was released in Salt 3005 according to their changelog.
Similarly, needs Salt>=3006 for Python 3.11+ for issue #62676.
Files
Updated by Brett Smith about 1 year ago
- Related to Idea #20846: Support Ubuntu 22.04 LTS added
Updated by Brett Smith about 1 year ago
- Description updated (diff)
- Subject changed from Salt installer needs Salt>=3005 to work with Python 3.10+ to Salt installer needs more modern Salt to work with more modern Python
Updated by Brett Smith about 1 year ago
- Subject changed from Salt installer needs more modern Salt to work with more modern Python to Update Salt installer to support Ubuntu 22.04/Debian 12
Updated by Brett Smith about 1 year ago
- Subject changed from Update Salt installer to support Ubuntu 22.04/Debian 12 to Update Salt installer to support Debian 12
21383-salt-debian12 @ e89e1e8c0f1907525559c8152f3286a70fe623cb
plus https://github.com/brettcs/postgres-formula/commit/cb19c884819a2d72f5462fdc41b7e527a715d489
The original reason we forked postgres-formula is gone: the conflicting key issue is now resolved in the main branch upstream. But it needs support for debian12 added. My branch adds that to the current upstream main branch. If this looks good in review, I assume we'll want to get my branch over to the arvados fork, and then point our installer at it. That's all fine.
Still works on debian11: test-provision-debian11: #577
and gets debian12 green: test-provision-debian12: #7
Can't test ubuntu2004 because of #21384, a preexisting issue. Can't test ubuntu2204 because we don't have a cloud image for it yet. But these changes were very straightforward so I don't see any reason to expect trouble there.
- All agreed upon points are implemented / addressed.
- Yes
- Anything not implemented (discovered or discussed during work) has a follow-up story.
- Yes, see above
- Code is tested and passing, both automated and manual, what manual testing was done is described
- See above
- Documentation has been updated.
- Not yet, expected to be done as part of #20846 after we've got ubuntu2204 working too.
- Behaves appropriately at the intended scale (describe intended scale).
- No change
- Considered backwards and forwards compatibility issues between client and server.
- Confirmed it doesn't break debian11
- Follows our coding standards and GUI style guidelines.
- N/A, no standards for Salt or Shell
Updated by Brett Smith about 1 year ago
After this is merged test-provision-debian12 should be added to the test-provision multijob.
Updated by Brett Smith about 1 year ago
- Blocks Feature #21388: Update list of supported distributions everywhere added
Updated by Peter Amstutz about 1 year ago
- Target version changed from Development 2024-01-17 sprint to Development 2024-01-31 sprint
Updated by Peter Amstutz about 1 year ago
- Target version changed from Development 2024-01-31 sprint to Development 2024-02-14 sprint
Updated by Lucas Di Pentima about 1 year ago
Just in case, I tried a multi-node installation on both debian 11 & 12. The former worked fine, for the latter, I used the ami-0e365edd3d30d031b
AMI (from https://wiki.debian.org/Cloud/AmazonEC2Image/Bookworm) and it's giving me a package repo security error, as commented on chat:
root@controller:/home/admin# apt-get update Get:1 file:/etc/apt/mirrors/debian.list Mirrorlist [38 B] Get:5 file:/etc/apt/mirrors/debian-security.list Mirrorlist [47 B] Hit:2 https://cdn-aws.deb.debian.org/debian bookworm InRelease Hit:3 https://cdn-aws.deb.debian.org/debian bookworm-updates InRelease Hit:4 https://cdn-aws.deb.debian.org/debian bookworm-backports InRelease Hit:6 https://cdn-aws.deb.debian.org/debian-security bookworm-security InRelease Get:7 http://apt.arvados.org/bookworm bookworm InRelease [4,111 B] Hit:8 https://repo.saltproject.io/salt/py3/debian/11/amd64/3006 bullseye InRelease Reading package lists... Done E: The repository 'http://apt.arvados.org/bookworm bookworm InRelease' provides only weak security information. N: Updating from such a repository can't be done securely, and is therefore disabled by default. N: See apt-secure(8) manpage for repository creation and user configuration details.
Updated by Brett Smith about 1 year ago
Lucas Di Pentima wrote in #note-9:
Just in case, I tried a multi-node installation on both debian 11 & 12. The former worked fine, for the latter, I used the
ami-0e365edd3d30d031b
AMI (from https://wiki.debian.org/Cloud/AmazonEC2Image/Bookworm) and it's giving me a package repo security error, as commented on chat:
I think you might be getting this because you're using the bookworm repository, as opposed to bookworm-dev. No releases have been published to that repository yet, meaning http://apt.arvados.org/bookworm/dists/bookworm/main/binary-amd64/Packages is empty, which means it has no checksums in it, which I think might result in the error you're seeing. Contrast http://apt.arvados.org/bookworm/dists/bookworm-dev/main/binary-amd64/Packages.
I think this is a good situation, actually, since we haven't published any official releases with Debian 12 support. Is it possible to do your test against bookworm-dev instead? Does that make sense?
For what it's worth I rebased the branch on current main, and reran test-deploy-debian12 with the packages that were published earlier, and it passed: test-provision-debian12: #8
Updated by Lucas Di Pentima about 1 year ago
Thanks, now I think I've found a real issue, at least with the way the Debian12 AMI works: It seems that the ami-0e365edd3d30d031b
AMI doesn't have the cron
service installed by default, and the logrotate-formula
fails because of this.
If you want, I can keep poking at it to make it work on our AWS sandbox account, or we can get you access to it if you want to do it yourself.
Updated by Brett Smith about 1 year ago
Lucas Di Pentima wrote in #note-11:
If you want, I can keep poking at it to make it work on our AWS sandbox account, or we can get you access to it if you want to do it yourself.
I would like a sandbox account please. If I'm gonna start working on the Salt installer in earnest it would be good for me to be able to do my own testing.
Updated by Peter Amstutz 12 months ago
- Target version changed from Development 2024-02-14 sprint to Development 2024-02-28 sprint
Updated by Brett Smith 12 months ago
- Blocks Task #21518: Rig up a Jenkins test job added
Updated by Peter Amstutz 12 months ago
- Target version changed from Development 2024-02-28 sprint to Development 2024-03-13 sprint
Updated by Peter Amstutz 11 months ago
- Target version set to Development 2024-03-13 sprint
- Tracker changed from Idea to Feature
Updated by Brett Smith 11 months ago
The story so far:
- The Debian 12 AMI does not install cron or any compatible tool by default. All you need is systemd timers.
- The Salt cron module does not even load unless the
crontab
command is available. - Several of our formulas call the cron module, like letsencrypt. Here's a rough survey:
admin@controller:/srv/formulas$ find -name \*.sls -print0 | xargs -0 grep -l '\bcron\.' ./prometheus/prometheus/exporters/node_exporter/textfile_collectors/smartmon/install.sls ./prometheus/prometheus/exporters/node_exporter/textfile_collectors/smartmon/clean.sls ./prometheus/prometheus/exporters/node_exporter/textfile_collectors/ipmitool/install.sls ./prometheus/prometheus/exporters/node_exporter/textfile_collectors/ipmitool/clean.sls ./extra/extra/shell_cron_add_login_sync.sls ./arvados/test/salt/states/examples/arvados/shell/cron/add-login-sync.sls ./letsencrypt/letsencrypt/domains.sls ./logrotate/logrotate/config.sls
Some of these are probably false positives because they're behind conditionals or whatever. For example, all you need to do to make logrotate happy is one configuration setting (2557919f8b6a82bef3f8d4f246996440841ceb10):
logrotate:
service: logrotate.timer
But we would at least need to:
- Fork/PR the letsencrypt formula
- Write our own systemd timer for arvados-login-sync
And even then I haven't dug in enough to know if the prometheus module would need configuration or updating or what. This would be the nice, clean, modern way to handle things.
Alternatively, we could just update our Salt installer to install a cron-compatible tool on all systems. (Please please please systemd-cron, not the classic cron.) This would be the cheap solution that keeps things running.
I'm too deep on this, need an external opinion.
Updated by Brett Smith 11 months ago
Brett Smith wrote in #note-17:
Alternatively, we could just update our Salt installer to install a cron-compatible tool on all systems. (Please please please systemd-cron, not the classic cron.) This would be the cheap solution that keeps things running.
Discussed with Lucas, we will install systemd-cron.
Updated by Brett Smith 11 months ago
- Blocked by Bug #21583: Running RailsAPI with Passenger implicity requires Ruby 3.3 via base64 0.2.0 lock added
Updated by Peter Amstutz 11 months ago
- Target version changed from Development 2024-03-13 sprint to Development 2024-03-27 sprint
Updated by Peter Amstutz 11 months ago
- Target version changed from Development 2024-03-27 sprint to Development 2024-04-10 sprint
Updated by Brett Smith 10 months ago
I have 21383-salt-debian12 rebased and in a state where the install succeeds and a lot of the cluster is functional except compute containers fail like this:
Apr 07 19:31:01 crunch-run[1267]: {"ClusterID":"z2a07","PID":1273,"RequestID":"req-hqftyd3mv560b6cv1503","level":"info","msg":"request","remoteAddr":"10.1.1.4:44390","reqBytes":0,"reqForwardedFor":"","reqHost":"10.1.1.4:39751","reqMethod":"GET","reqPath":"96e625dfed222073491b5fee2fbf5eb9+24064+Ae635f42a3378f85d124c748936cc0ae5b51831b8@66130226","reqQuery":"","time":"2024-04-07T19:31:01.256302800Z"} Apr 07 19:32:33 crunch-run[1267]: {"ClusterID":"z2a07","PID":1273,"RequestID":"req-hqftyd3mv560b6cv1503","level":"info","msg":"response","priority":0,"queue":"api","remoteAddr":"10.1.1.4:44390","reqBytes":0,"reqForwardedFor":"","reqHost":"10.1.1.4:39751","reqMethod":"GET","reqPath":"96e625dfed222073491b5fee2fbf5eb9+24064+Ae635f42a3378f85d124c748936cc0ae5b51831b8@66130226","reqQuery":"","respBody":"exceeded maximum number of attempts, 3, request send failed, Get \"https://z2a07-nyw5e-000000000000000-volume.s3.us-east-1.amazonaws.com/96e/96e625dfed222073491b5fee2fbf5eb9\": dial tcp 52.216.178.118:443: i/o timeout\n","respBytes":216,"respStatus":"Internal Server Error","respStatusCode":500,"time":"2024-04-07T19:32:33.291263581Z","timeToStatus":92.034926,"timeTotal":92.034952,"timeWriteBody":0.000026}
This feels like it's probably not a Salt issue but something in the cloud configuration where compute node keepstore needs to be able to contact S3 but doesn't have permission. Maybe a Terraform problem, or I overlooked a step in the install process I was supposed to do manually. Keep generally works, I can arv-put and arv-get things, so it's not that.
Updated by Brett Smith 10 months ago
Confirmed:
- The IAM role z2a07-compute-node-00-iam-role exists
- It includes the S3 access policy it should
- The cluster's Containers.CloudVMs.DriverParameters.IAMInstanceProfile is set to that
- The compute node security group allows all outbound traffic
- Compute nodes are booted in the same subnet as the rest of the cluster
Running low on ideas.
Updated by Brett Smith 10 months ago
- Blocks deleted (Task #21518: Rig up a Jenkins test job)
Updated by Brett Smith 10 months ago
- Blocks Support #21664: Add test provision ubuntu 22.04 & make sure it passes added
Updated by Brett Smith 10 months ago
- Blocks Support #21663: Add test provision debian 12 & make sure it passes added
Updated by Brett Smith 10 months ago
Shower thought: the compute node issue might be because of IMDSv2? The compute node image was built before that change, but the rest of the cluster was installed after. I'm not totally sure they're related but it's at least plausible enough to build a new compute node image.
Updated by Brett Smith 10 months ago
Brett Smith wrote in #note-29:
Shower thought: the compute node issue might be because of IMDSv2? The compute node image was built before that change, but the rest of the cluster was installed after. I'm not totally sure they're related but it's at least plausible enough to build a new compute node image.
Built a new image, no obvious progress on the problem.
Updated by Peter Amstutz 10 months ago
- Target version changed from Development 2024-04-10 sprint to Development 2024-04-24 sprint
Updated by Brett Smith 10 months ago
21383-salt-debian12 @ 693a0d0d1247af022040700c7a9524f35b2237ca
build-packages-ubuntu2004: #1678
test-provision-ubuntu2004: #832
This is a little unothordox, but I would like this branch to be reviewed and merged with an eye towards fixing test-provision-ubuntu2004. It is rebased on current main and has three basic parts:
- Upgrades to the Salt installer and formulas to support newer distributions. This is what you reviewed before. All of the 21383 commits are the same as before except db6d1ebdedf07b714d9664313c085aa6bc621277 which can now use an upstream prerelease rather than a fork. This may not be strictly necessary for Ubuntu 20.04, but note:
- In the run-up to the 2.7.2 release we started having problems running test-provision with older versions of Salt. It might have been a temporary infrastructure hiccup on their end, but upgrading helps avoid trouble by staying more modern.
- If we're going to do these upgrades to support newer distributions, they need to continue to work on older distributions we support. The above run shows that it does.
- Upgrading this stuff is a good thing to do even if we have no immediate functional need for it.
- I do have a test Debian 12 cluster up and running based on this branch. It mostly works except compute nodes can't contact S3 to access Keep blocks. That can be addressed in a separate bug fix after, but this goes to show the changes are on point.
- Adds
passenger_preload_bundler on
to our nginx configuration, which is definitely necessary for Debian 12 and may be necessary for older distributions as well. It may depend on the version of Passenger more than the distro. It is something we want to be doing anyway. See #21583#note-17 - Adds a workaround for an apparent Bundler bug to our Rails postinst script. This is definitely necessary to get test-provision-ubuntu2004 going. See #21524#note-17
So, if merged, this branch would close #21583, #21524, and #21661. It would move the ball forward on this ticket #21383, but not resolve it until we figure out the S3 access issue.
Updated by Lucas Di Pentima 10 months ago
Brett Smith wrote in #note-29:
Shower thought: the compute node issue might be because of IMDSv2? The compute node image was built before that change, but the rest of the cluster was installed after. I'm not totally sure they're related but it's at least plausible enough to build a new compute node image.
I think you'll want to recreate the compute image no matter what, because the older ebs-autoscale
won't work with IMDSv2.
Updated by Peter Amstutz 10 months ago
- Target version changed from Development 2024-04-24 sprint to Development 2024-05-08 sprint
Updated by Brett Smith 10 months ago
crunch@ip-10-1-1-243:~$ aws s3 ls s3://z2a07-nyw5e-000000000000000-volume/ Connect timeout on endpoint URL: "https://z2a07-nyw5e-000000000000000-volume.s3.us-east-1.amazonaws.com/?list-type=2&prefix=&delimiter=%2F&encoding-type=url"
Updated by Brett Smith 10 months ago
Compute node can't contact anything outside the subnet, which is weird because it seems to have an internet gateway attached.
Updated by Brett Smith 10 months ago
Brett Smith wrote in #note-38:
Compute node can't contact anything outside the subnet, which is weird because it seems to have an internet gateway attached.
This was PEBKAC. I accidentally set compute_subnet
in local.params
to be Terraform's arvados_subnet_id
when it should've been compute_subnet_id
. This meant the compute nodes were trying to send traffic out through the Internet gateway, which doesn't work, because you need a public IPv4 address to do that and the compute nodes don't get one, they need to use NAT instead.
I've worked so on-and-off on this ticket, I don't know if I just didn't realize there were two subnets, or had a brain fart when I copied+pasted, or what. Our install docs do at least say:
You’ll also need
compute_subnet_id
andarvados_sg_id
to setCOMPUTE_SUBNET
andCOMPUTE_SG
in local.params…
So it is technically covered. But, in between that sentence and me editing local.params
, I went on the whole side quest of building a compute image. By the time I got that done, this connection wasn't front of mind, and nothing under the "Parameters from local.params" section reminds you of this, I think because that section doesn't want to assume you're installing in the cloud.
Like I mentioned at standup, I think what would've helped most is not needing to do this, and having "the installer" as a whole make the connections itself. I understand why we haven't done that in the current system: Terraform and Salt are two completely different systems and documenting "copy and paste these twelve values" seems way easier than building a good system to pass information between them. But, the current situation is error-prone. A tool like Ansible could take on both responsibilities in one stack and eliminate this problem.
Anyway, with just that subnet change, everything works and I think we can call this done finally:
admin@shell:~$ arvados-client diagnostics -internal-client INFO 5: running health check (same as `arvados-server check`) INFO ... skipping because config could not be loaded: open /etc/arvados/config.yml: no such file or directory INFO 10: getting discovery document from https://z2a07.arvadosapi.com/discovery/v1/apis/arvados/v1/rest INFO 20: getting exported config from https://z2a07.arvadosapi.com/arvados/v1/config INFO 30: getting current user record INFO 40: connecting to service endpoint https://keep.z2a07.arvadosapi.com:443/ INFO 41: connecting to service endpoint https://*.collections.z2a07.arvadosapi.com:443/ INFO 42: connecting to service endpoint https://download.z2a07.arvadosapi.com:443/ INFO 43: connecting to service endpoint wss://ws.z2a07.arvadosapi.com/websocket INFO 44: connecting to service endpoint https://workbench.z2a07.arvadosapi.com:443/ INFO 45: connecting to service endpoint https://workbench2.z2a07.arvadosapi.com:443/ INFO 50: checking CORS headers at https://z2a07.arvadosapi.com:443/ INFO 51: checking CORS headers at https://keep.z2a07.arvadosapi.com:443/d41d8cd98f00b204e9800998ecf8427e+0 INFO 52: checking CORS headers at https://download.z2a07.arvadosapi.com:443/ INFO 60: checking internal/external client detection INFO ... controller returned only non-proxy services, this host is treated as "internal" INFO 61: reading+writing via keep service at http://10.1.2.13:25107/ INFO 80: finding/creating "scratch area for diagnostics" project INFO 90: creating temporary collection [+] Building 0.4s (8/8) FINISHED docker:default => [internal] load build definition from Dockerfile 0.0s => => transferring dockerfile: 241B 0.0s => [internal] load metadata for docker.io/library/debian:stable-slim 0.1s => [internal] load .dockerignore 0.0s => => transferring context: 2B 0.0s => [1/3] FROM docker.io/library/debian:stable-slim@sha256:ff394977014e94e9a7c67bb22f5014ea069d156b86e001174f4bae6f4618297a 0.0s => [internal] load build context 0.2s => => transferring context: 17.97MB 0.1s => CACHED [2/3] RUN apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install --yes --no-install-recommends libfuse2 ca-ce 0.0s => CACHED [3/3] COPY /arvados-client /arvados-client 0.0s => exporting to image 0.0s => => exporting layers 0.0s => => writing image sha256:cef4c288409db3753ff235448f01ab297bfff0bc96ad15e407af6069e03d28c4 0.0s => => naming to docker.io/library/arvados-client-diagnostics:1a1fae56b 0.0s INFO ... arvados-client version: /arvados-client 2.8.0~dev20240426140337 (go1.20.6) INFO ... docker image size is 125100032 INFO 100: uploading file via webdav INFO 110: checking WebDAV ExternalURL wildcard (https://*.collections.z2a07.arvadosapi.com:443/) INFO 120: downloading from webdav (https://d41d8cd98f00b204e9800998ecf8427e-0.collections.z2a07.arvadosapi.com:443/foo) INFO 121: downloading from webdav (https://d41d8cd98f00b204e9800998ecf8427e-0.collections.z2a07.arvadosapi.com:443/sha256:cef4c288409db3753ff235448f01ab297bfff0bc96ad15e407af6069e03d28c4.tar) INFO 122: downloading from webdav (https://download.z2a07.arvadosapi.com:443/c=d41d8cd98f00b204e9800998ecf8427e+0/_/foo) INFO 123: downloading from webdav (https://download.z2a07.arvadosapi.com:443/c=d41d8cd98f00b204e9800998ecf8427e+0/_/sha256:cef4c288409db3753ff235448f01ab297bfff0bc96ad15e407af6069e03d28c4.tar) INFO 124: downloading from webdav (https://e8cfb618b6149ec6204c5be0a829c461-177.collections.z2a07.arvadosapi.com:443/sha256:cef4c288409db3753ff235448f01ab297bfff0bc96ad15e407af6069e03d28c4.tar) INFO 125: downloading from webdav (https://download.z2a07.arvadosapi.com:443/c=z2a07-4zz18-1v1t5d2e8po4vql/_/sha256:cef4c288409db3753ff235448f01ab297bfff0bc96ad15e407af6069e03d28c4.tar) INFO 130: getting list of virtual machines INFO 150: connecting to webshell service INFO 160: running a container INFO ... container request uuid = z2a07-xvhdp-ftr6bsybc5vbgp6 INFO ... container request submitted, waiting up to 10m for container to run INFO 9990: deleting temporary collection INFO --- no errors ---
Updated by Lucas Di Pentima 10 months ago
I've deployed a debian12 cluster from scratch, it worked but I had a couple of issues:
- For some reason, terraform was getting an error like "Error: creating EC2 Instance: InvalidParameterValue: A value of '' is not valid for HttpEndpoint. Specify either 'enabled' or 'disabled' and try again." -- This is weird because
http_endpoint
is supposed to be enabled by default (as per the docs). I've enabled it explicitly in ada39ab - branch21383-misc-fixes
and it worked. - The second issue I haven't been able to fix it yet: I'm not able to create a compute image based on Debian 12. The error is: Unable to locate packages python3-arvados-fuse & arvados-docker-cleaner. I suspect it has something to do with
apt-key
changes, but OTOH the service nodes were deployed correctly, so not 100% sure about that.
Updated by Brett Smith 10 months ago
Lucas Di Pentima wrote in #note-40:
- For some reason, terraform was getting an error like "Error: creating EC2 Instance: InvalidParameterValue: A value of '' is not valid for HttpEndpoint. Specify either 'enabled' or 'disabled' and try again." -- This is weird because
http_endpoint
is supposed to be enabled by default (as per the docs). I've enabled it explicitly in ada39ab - branch21383-misc-fixes
and it worked.
This sounds fine to me.
- The second issue I haven't been able to fix it yet: I'm not able to create a compute image based on Debian 12. The error is: Unable to locate packages python3-arvados-fuse & arvados-docker-cleaner. I suspect it has something to do with
apt-key
changes, but OTOH the service nodes were deployed correctly, so not 100% sure about that.
Are you running the compute image build.sh
with the --reposuffix -dev
option? This is another area where you have to manually sync configuration across different parts of the process. It makes sense it wouldn't find these packages if it's checking the production repository (the default), since that's currently empty.
Updated by Brett Smith 10 months ago
I don't think there are any surprises here but for posterity here was my full compute image build command, run from arvados/tools/compute-images
. Variables were sourced from terraform.log
.
./build.sh \
--json-file arvados-images-aws.json \
--arvados-cluster-id "$cluster_name" \
--aws-source-ami ami-0e365edd3d30d031b \
--aws-vpc-id "$vpc_id" \
--aws-subnet-id "$arvados_subnet_id" \
--public-key-file ~/.ssh/PUBKEY.pub \
--reposuffix -dev \
--ssh_user admin
Updated by Lucas Di Pentima 9 months ago
Brett Smith wrote in #note-41:
Specify either 'enabled' or 'disabled' and try again." -- This is weird because http_endpoint
is supposed to be enabled by default (as per the docs). I've enabled it explicitly in ada39ab - branch 21383-misc-fixes
and it worked.
This sounds fine to me.
Ok, will merge. Thanks.
Are you running the compute image
build.sh
with the--reposuffix -dev
option? This is another area where you have to manually sync configuration across different parts of the process. It makes sense it wouldn't find these packages if it's checking the production repository (the default), since that's currently empty.
Whoops, forgot using --reposuffix -dev
. That clears my last issue, we can mark this as resolved.
Updated by Brett Smith 9 months ago
- Status changed from In Progress to Resolved