Feature #22317
closedCompute image builder uses ansible to provision compute node (still invoked by packer)
Updated by Peter Amstutz 2 months ago
- Target version changed from Development 2024-12-04 to Development 2024-11-20
Updated by Brett Smith 2 months ago
22317-compute-node-ansible @ 2638fb1852ec48f5b65364768ab9978dda818efa
This is far enough along that people should look at it and we should make sure we like it before I continue with updating documentation and other interfaces, even though it is not ready to be merged as-is.
I have used this branch to build and successfully test a new compute image for tordo. ami-077c29f012ca7df74, tordo-xvhdp-s60utv85ngx4as9. Note this image uses Docker, not Singularity. I also manually booted an instance of this AMI and SSH'ed into it to dig into the state of specific systemd services, individual configuration files, etc. Everything looks as expected.
Assuming you already have Packer installed locally, you can test this yourself by installing Ansible and the Packer provisioner for it. You can adjust the Ansible install path to taste:
python3 -m venv ~/ansible
~/ansible/bin/pip install 'ansible~=8.7'
packer plugins install github.com/hashicorp/ansible
Once this is done, you can test by activating the virtualenv with . ~/ansible/bin/activate
, then run build.sh
as you normally would.
The choice of using Ansible 8 was semi-arbitrary. Basically every release of Ansible supports three versions of Python, and this was the latest Ansible version that covered the older Python versions we care about. Our final version requirement may need some additional flexibility. But for now I wanted to target a specific version for simplicity of testing.
The way this works is that when Packer gets to the provision step, it runs ansible-playbook
in a way that's configured to know about a single host default
that is the cloud instance it created. Ansible runs through the tasks defined in build-compute-image.yml
in order. When a role is included, it starts running the tasks defined in roles/NAME/tasks/main.yml
. Tasks in a role can refer to files and templates by relative path, which are found under the role's files
and templates
subdirectories respectively.
Ansible uses Jinja templates demarcated with {{ }}
for variable data. Default values for variables are set in roles/NAME/defaults/main.yml
. Values that the user provides override it. An override file is generated for you by build.sh
and passed on to Ansible.
The Git commit message has some details about some things I did a little differently than base.sh
and why.
Personally, I would like to get away from build.sh
. The interface is not especially friendly, and it's a lot of code to provide relatively marginal value IMO. I would rather this whole build step work more like Arvados itself: we give you a YAML file of Ansible settings you can edit, plus one JSON file per cloud of Packer settings you can edit, and then you just invoke Packer with your edited JSON. I am curious what you think about that. If it sounds good to you, I can make it part of the branch, and update documentation accordingly.
Updated by Lucas Di Pentima 2 months ago
My comments:
- Code looks a lot cleaner and organized than the previous thing.
- I agree that getting rid of
build.sh
will simplify things even further and we won't be losing a lot in the process. - I realized that
cloud-init
is not requested to be installed anymore, is that on purpose? - Are you planning on supporting Ubuntu as well? AFAIK the previous packer script was usable with Ubuntu-based AMIs
- Would it be convenient/possible to declare & check the expected Ansible (or its plugin) version from packer?
Updated by Brett Smith 2 months ago
Okay, I'll keep working on it from here, but for background:
Lucas Di Pentima wrote in #note-6:
- I realized that
cloud-init
is not requested to be installed anymore, is that on purpose?
Yes. By my reading, the main reason we installed that was to hook in the encrypted partition script as a boot script. Now that I've written a systemd service definition for that, we don't have any specific need for cloud-init ourselves.
To be clear, I expect most if not all of the images that we build on top of to use it. But if that's the case it should already be installed, we shouldn't need to repeat that. But if you know something I missed, please let me know.
- Are you planning on supporting Ubuntu as well? AFAIK the previous packer script was usable with Ubuntu-based AMIs
I was not able to get it to work in my recent testing. See my comments on #22217#note-9. That said, the only thing I think we need is an apt repository definition, which is no big deal, I can add that.
- Would it be convenient/possible to declare & check the expected Ansible (or its plugin) version from packer?
The plugin documentation describes how you can write a wrapper script that activates a virtualenv and potentially does other work before starting ansible-playbook
. That would probably be the place to implement checks like that.
I am open to doing stuff like this, but personally I would like to hold off on it a little bit. When I wrote this branch, I tried to keep in mind the long-term goal of writing a whole installer based on Ansible. For example, the arvados_apt
role should be usable on all nodes, roles like compute_docker
could be reused for the shell node, etc.
I would like to be a little further along in that process, and have some idea of how we expect users to configure and run the whole Ansible installer, before we start codifying decisions about how and where Ansible gets installed and configured. I would rather not codify decisions like that now for the compute node builder, only to throw them all out again when we start working on the installer more generally. Because this compute node builder is a very ops-focused thing where the user is expected to be a little more expert, I think it's okay to leave it a little sharp around the edges for the time being.
Updated by Peter Amstutz 2 months ago
- Target version changed from Development 2024-11-20 to Development 2024-12-04
Updated by Brett Smith 2 months ago
22317-compute-node-ansible @ 81dfb4a1503fd923411baf484c99b94400b889e0
This has a couple small bugfixes from the previous branch, then goes on to remove build.sh
and update the documentation with the new process. It guides the user through writing compute node configuration (for Ansible) and cloud configuration (for Packer), then running the tool directly.
- All agreed upon points are implemented / addressed.
- Does what the ticket title says
- Anything not implemented (discovered or discussed during work) has a follow-up story.
- N/A
- Code is tested and passing, both automated and manual, what manual testing was done is described
- See notes about testing in #note-5. I also ran the playbook on Debian VMs with a couple different configurations to check common configurations: that it installs either Docker or Singularity directly, that it pins packages or doesn't, etc.
I can't test on Jenkins because we have a chicken-and-egg problem where Jenkins will need updates to use the new process, which I can't really set up before the branch is merged. But to give you a preview, I will attach a copy of what I expect the packer-build-compute-node shell to look like after this branch is merged. It's not a huge improvement, but I don't think it's worse than the current configuration either.
- See notes about testing in #note-5. I also ran the playbook on Debian VMs with a couple different configurations to check common configurations: that it installs either Docker or Singularity directly, that it pins packages or doesn't, etc.
- Documentation has been updated.
- Yes.
- Behaves appropriately at the intended scale (describe intended scale).
- N/A?
- Considered backwards and forwards compatibility issues between client and server.
- N/A
- Follows our coding standards and GUI style guidelines.
- N/A (no style guidelines for any of these languages)
Updated by Brett Smith 2 months ago
- File deleted (
packer-build-compute-image.sh)
Updated by Brett Smith 2 months ago
Meant to add: we could make the Jenkins situation nicer by writing static Packer and Ansible configuration files for each cluster. If we do that, then the shell code "just" has to pick the right configuration file for Packer, it doesn't need to have that all written in the code. I think that would be a useful improvement but we need to talk about where all those files would live and how Jenkins would get at them, so that feels like a separate ticket thing if we're interested in it.
Updated by Lucas Di Pentima 2 months ago
I've read and followed the packer+ansible documentation and had some issues that I'm not entirely sure were my own fault.
I made the test using a debian 11 VM with packer version 1.11.2 installed from hashicorp's sources, and ansible core 2.15.13 (python 3.9.2)
- Documentation instructs to use the command
git clone --depth=1 --branch=main https://git.arvados.org/arvados.git
but I got he following error: "fatal: dumb http transport does not support shallow capabilities", it worked when usinggit://...
instead. - When running
packer build ...
got the following errors:- variable aws_ena_support: '' expected type 'string', got unconvertible type 'bool', value: 'true'
- variable aws_associate_public_ip_address: '' expected type 'string', got unconvertible type 'bool', value: 'true'
- How I fixed it: replaced
true
with"true"
on boolean vars - Had to manually install the following plugins (I believe if we migrate to HCL, a simple
packer init
would auto-download the plugins, but the migration might not be a trivial task)packer plugin install github.com/hashicorp/amazon
packer plugin install github.com/hashicorp/ansible
- Had to remove the
--scp-extra-args='-O'
because debian 11'sscp
doesn't support it.
diff --git a/tools/compute-images/aws_config.example.json b/tools/compute-images/aws_config.example.json
index 8e868cf..5fccace 100644
--- a/tools/compute-images/aws_config.example.json
+++ b/tools/compute-images/aws_config.example.json
@@ -7,8 +7,8 @@
"aws_access_key": "",
"aws_secret_key": "",
"aws_source_ami": "ami-0a9d5908c7201e91d",
- "aws_associate_public_ip_address": true,
- "aws_ena_support": true,
+ "aws_associate_public_ip_address": "true",
+ "aws_ena_support": "true",
"ssh_user": "admin",
"subnet_id": "",
"vpc_id": ""
diff --git a/tools/compute-images/aws_template.json b/tools/compute-images/aws_template.json
index 5db1127..33e49df 100644
--- a/tools/compute-images/aws_template.json
+++ b/tools/compute-images/aws_template.json
@@ -8,8 +8,8 @@
"aws_region": "",
"aws_secret_key": "",
"aws_source_ami": "ami-0a9d5908c7201e91d",
- "aws_associate_public_ip_address": true,
- "aws_ena_support": true,
+ "aws_associate_public_ip_address": "true",
+ "aws_ena_support": "true",
"ssh_user": "admin",
"subnet_id": "",
"vpc_id": ""
@@ -69,8 +69,7 @@
"playbook_file": "ansible/build-compute-image.yml",
"user": "{{user `ssh_user`}}",
"extra_arguments": [
- "--extra-vars", "@{{ user `ansible_vars_file` }}",
- "--scp-extra-args", "'-O'"
+ "--extra-vars", "@{{ user `ansible_vars_file` }}"
]
}]
}
Apart from the above, this LGTM.
Re: Jenkins updates, I think if we're going to invest time to make Jenkins more friendly, it would be nice to also consider making the pipelines tracked from git instead of using the awful Web UI.
Updated by Brett Smith 2 months ago
Lucas Di Pentima wrote in #note-14:
I've read and followed the packer+ansible documentation and had some issues that I'm not entirely sure were my own fault.
Yup, these are all good comments, thanks. Now at 6ef98cf0768eecd73950ed6028288908b61f2da0 to address all your comments except:
- Had to remove the
--scp-extra-args='-O'
because debian 11'sscp
doesn't support it.
This is a good catch but I can't take your diff as-is because this option is required on my Debian 12 system. It is caused by OpenSSH 9.0 which changed scp to use the SFTP protocol under the hood. Of course, versions of Ansible released before this were written to expect that scp was using the older SCP protocol, so if you're using an older version of Ansible with a newer version of SSH, you must pass this option to get scp to behave the way that Ansible is expecting.
Literally none of this code is under our control—not SSH, not Ansible, not Packer—so I'm low on options for dealing with it. As best I can tell, the best option available to me is literally write this entire explanation in the install documentation, walking users through the process of checking their SSH version and maybe their Ansible version and then configuring Ansible to always use this option if the combination requires it. I don't think this is very reader-friendly, but I don't know what else to do if we expect users to be able to run this with arbitrary combinations of SSH and Ansible versions, which we've decided is a requirement. Curious if you have any other ideas though.
Re: Jenkins updates, I think if we're going to invest time to make Jenkins more friendly, it would be nice to also consider making the pipelines tracked from git instead of using the awful Web UI.
You don't have to convince me, I'm totally in favor of this too.
Updated by Lucas Di Pentima 2 months ago
Brett Smith wrote in #note-15:
- Had to remove the
--scp-extra-args='-O'
because debian 11'sscp
doesn't support it.This is a good catch but I can't take your diff as-is because this option is required on my Debian 12 system. It is caused by OpenSSH 9.0 which changed scp to use the SFTP protocol under the hood. Of course, versions of Ansible released before this were written to expect that scp was using the older SCP protocol, so if you're using an older version of Ansible with a newer version of SSH, you must pass this option to get scp to behave the way that Ansible is expecting.
Literally none of this code is under our control—not SSH, not Ansible, not Packer—so I'm low on options for dealing with it. As best I can tell, the best option available to me is literally write this entire explanation in the install documentation, walking users through the process of checking their SSH version and maybe their Ansible version and then configuring Ansible to always use this option if the combination requires it. I don't think this is very reader-friendly, but I don't know what else to do if we expect users to be able to run this with arbitrary combinations of SSH and Ansible versions, which we've decided is a requirement. Curious if you have any other ideas though.
I looked and am surprised that scp
doesn't support a backwards compatible auto-fallback feature.
I think mentioning the issue in the documentation is a good solution, but the default configuration could be aimed to users that probably use the newest distros, so the -O
wouldn't be needed. Am I understanding correctly that you needed it because you have a newer OpenSSH in Debian 12 that needed to deploy to an older Debian 11 base AMI? If so, we can change the base AMI to point to a Debian 12. The probability of users trying the Ansible installer with post-v9.0 openssh should increase as months go by, so IMO is the best case to support by default.
Updated by Brett Smith about 2 months ago
Lucas Di Pentima wrote in #note-16:
I looked and am surprised that
scp
doesn't support a backwards compatible auto-fallback feature.
The issue is more subtle than that because the SFTP protocol has been around for a long time, so older servers will have it. But features have been added to the protocol recently and if the server interprets your path wrong I guess it's hard to tell whether that means "SFTP server is too old" or some other problem. Quoting the man page for -O
:
Forcing the use of the SCP protocol may be necessary for servers that do not implement SFTP, for backwards-compatibility for particular filename wildcard patterns and for expanding paths with a ‘~’ prefix for older SFTP servers.
Based on the error message, I think the last point is probably what is tripping up Ansible.
Anyway, now at 3f8701af750d20c9cc19b9935b18c401dc5df634 with your change plus documentation about how to reconfigure Ansible if necessary.
Updated by Brett Smith about 2 months ago
I just pushed 64f0573c9a57d8840a1dcae3814664bc770503f4 which changes how Ansible is configured. Instead of having a few settings that must match the cluster configuration, you just point it at your cluster configuration, and Ansible reads the settings it needs out of there directly.
This feature is one we consider a requirement for a general Ansible installer. Introducing it now demonstrates the technique for future development, and helps administrators avoid situations where the cluster configuration and Ansible configuration get out of sync.
I have tested the new playbook on a local VM with a configuration that makes sure all affected plays are covered. The new playbook code correctly got all the configuration data it was supposed to.
Updated by Lucas Di Pentima about 2 months ago
I think that using the arvados config file to pass configuration to Ansible will be really useful for avoiding out of sync situations. I have a couple questions about it:
- File
tools/compute-images/ansible/build-compute-image.yml
Line 24- It seems to me that this doesn't support embedded privkeys in the config file, would it be possible to add that? If not, should it be documented?
- If Ansible is expecting that the
/etc/arvados/config.yml
file to have a path to the privkey, I think it could be a problem because the path should correspond to an arvados node path, and the compute image building script won't always be run in a cluster node (e.g: a CI worker node). So maybe we can have aprefix
variable that gets prepended to this value?
Updated by Brett Smith about 2 months ago
Lucas Di Pentima wrote in #note-19:
- It seems to me that this doesn't support embedded privkeys in the config file, would it be possible to add that? If not, should it be documented?
It does support an embedded privkey. If the value is not a file:
URL, it writes the key to ssh-keygen
's stdin. Added a comment to explain this in 0a69626c2c56c3af58ab6c4bf1b8d37a2ed14e1d.
- If Ansible is expecting that the
/etc/arvados/config.yml
file to have a path to the privkey, I think it could be a problem because the path should correspond to an arvados node path, and the compute image building script won't always be run in a cluster node (e.g: a CI worker node). So maybe we can have aprefix
variable that gets prepended to this value?
I had similar thoughts after I pushed. I suspect Tom would have opinions about this, but, I guess we can't ask him right now.
In 7db50ebe048a59c489100f3dbf8d556e85dbbb27 we use compute_authorized_keys
as a source of additional public keys to install. This meets the need: CI nodes can just have a copy of the public key and use that. It also gives us more flexibility: it lets administrators authorize additional public keys for any reason, and it makes development easier by letting developers just point at their SSH public key without mucking with their test cluster configuration.
Updated by Lucas Di Pentima about 2 months ago
Brett Smith wrote in #note-20:
It does support an embedded privkey. If the value is not a
file:
URL, it writes the key tossh-keygen
's stdin. Added a comment to explain this in 0a69626c2c56c3af58ab6c4bf1b8d37a2ed14e1d.
Thanks, I initially misread the code. The comment helps a lot.
In 7db50ebe048a59c489100f3dbf8d556e85dbbb27 we use
compute_authorized_keys
as a source of additional public keys to install. This meets the need: CI nodes can just have a copy of the public key and use that. It also gives us more flexibility: it lets administrators authorize additional public keys for any reason, and it makes development easier by letting developers just point at their SSH public key without mucking with their test cluster configuration.
That's a nice addition, LGTM.
Updated by Brett Smith about 2 months ago
- Status changed from In Progress to Resolved
Updated by Brett Smith about 2 months ago
- Precedes Feature #22383: Update packer-build-compute-image Jenkins job to use Ansible added