Bug #22788
closedCompute image builder fails when using a salt installer-generated arvados config file
Description
Deployed a new 3.1.1 cluster with the salt installer. Then, ran the compute image build script passing the config.yml
file from one of the nodes that includes the SSH dispatcher privkey as arvados_config_file
in host_config.yml
, also leaving compute_authorized_keys
commented out so that it can use the one provided with the config.yml
file.
It seems that it's trying to use the privkey defined in the config file as a file path. Changing the config file's Containers.DispatchPrivateKey
entry to the privkey file path fixes the issue.
amazon-ebs: TASK [Load Arvados cluster configuration] ************************************** amazon-ebs: ok: [default] amazon-ebs: amazon-ebs: TASK [Get Crunch dispatch public key] ****************************************** amazon-ebs: fatal: [default -> localhost]: FAILED! => {"changed": true, "cmd": ["ssh-keygen", "-y"], "delta": "0:00:00.003571", "end": "2025-04-15 11:41:40.573289", "msg": "non-zero return code", "rc": 255, "start": "2025-04-15 11:41:40.569718", "stderr": "-----BEGIN OPENSSH PRIVATE KEY-----: No such file or directory", "stderr_lines": ["-----BEGIN OPENSSH PRIVATE KEY-----: No such file or directory"], "stdout": "Enter file in which the key is (/home/lucas/.ssh/id_rsa): ", "stdout_lines": ["Enter file in which the key is (/home/lucas/.ssh/id_rsa): "]} amazon-ebs: amazon-ebs: PLAY RECAP ********************************************************************* amazon-ebs: default : ok=3 changed=0 unreachable=0 failed=1 skipped=0 rescued=0 ignored=0 amazon-ebs: ==> amazon-ebs: Provisioning step had errors: Running the cleanup provisioner, if present... ==> amazon-ebs: Terminating the source AWS instance... ==> amazon-ebs: Cleaning up any extra volumes...
Files
Updated by Lucas Di Pentima 4 days ago
Also tried with a config file from arvados-server config-dump
, because the privkey formatting is a bit different, but failed in the same way.
Updated by Brett Smith 4 days ago
22788-ansible-key-fix @ 6ceacf85425f67b372395cdfb01ac593e5c7d5aa - packer-build-compute-image: #328
- All agreed upon points are implemented / addressed.
- Yes
- Anything not implemented (discovered or discussed during work) has a follow-up story.
- N/A
- Code is tested and passing, both automated and manual, what manual testing was done is described
- See above. Tested manually by copying this first play into a separate test playbook, then running it against a test cluster with
ansible-playbook -v
, withDispatchPrivateKey
set to either key content or afile
URL. The output fromssh-keygen
was the same in both cases.
- See above. Tested manually by copying this first play into a separate test playbook, then running it against a test cluster with
- Documentation has been updated.
- N/A
- Behaves appropriately at the intended scale (describe intended scale).
- N/A
- Considered backwards and forwards compatibility issues between client and server.
- N/A
- Follows our coding standards and GUI style guidelines.
- N/A (no Ansible style guide)
Updated by Brett Smith 4 days ago
- Target version changed from Development 2025-04-30 to Development 2025-04-16
- Assigned To set to Brett Smith
- Status changed from New to In Progress
Updated by Brett Smith 4 days ago
Unfortunately the tordo build doesn't really exercise this functionality. Since the Jenkins job relies on the public cluster config, it doesn't have the dispatch key there, and instead loads it separately through an Ansible variable. The build shows I didn't add a syntax error or anything, but unfortunately manual testing is all we have to go on. See my log attached. You'll just have to trust I didn't forge it.
Updated by Brett Smith 4 days ago
The build failure is unrelated, instead it happened because the newest versions of CUDA move files around in a way that breaks our Ansible playbook. We'll need to update it accordingly. Filed #22792.
This generally won't affect production builds because they pin to CUDA 560. tordo does not pin because it's the development cluster where we're specifically looking to shake out issues like this.
Updated by Lucas Di Pentima 4 days ago
I tested with the same arvados config file as before, and I got this error:
amazon-ebs: TASK [Save dispatch private key to tempfile] *********************************** amazon-ebs: changed: [default -> localhost] amazon-ebs: amazon-ebs: TASK [Derive dispatch public key] ********************************************** amazon-ebs: fatal: [default -> localhost]: FAILED! => {"changed": true, "cmd": ["ssh-keygen", "-y", "-f", "/tmp/ansible.q6qynbjs.key"], "delta": "0:00:00.001825", "end": "2025-04-15 14:09:26.816157", "msg": "non-zero return code", "rc": 255, "start": "2025-04-15 14:09:26.814332", "stderr": "Load key \"/tmp/ansible.q6qynbjs.key\": invalid format", "stderr_lines": ["Load key \"/tmp/ansible.q6qynbjs.key\": invalid format"], "stdout": "", "stdout_lines": []} amazon-ebs: amazon-ebs: TASK [Remove dispatch private key tempfile] ************************************ amazon-ebs: changed: [default -> localhost] amazon-ebs: amazon-ebs: PLAY RECAP ********************************************************************* amazon-ebs: default : ok=6 changed=3 unreachable=0 failed=1 skipped=0 rescued=0 ignored=0
So, I commented out the "Remove dispatch private key tempfile" step and looked at what was producing as a tempfile. It seemed OK but i still was getting the same error:
(venv-ansible) lucas@debian11:~/arvados/tools/compute-images$ ssh-keygen -y -f /tmp/ansible.r9yrpm54.key Load key "/tmp/ansible.r9yrpm54.key": invalid format
Then I tried diffing it against the original:
(venv-ansible) lucas@debian11:~/arvados/tools/compute-images$ diff -Naur /tmp/ansible.r9yrpm54.key ~/.ssh/id_dispatcher --- /tmp/ansible.r9yrpm54.key 2025-04-15 14:25:38.576163327 -0300 +++ /home/lucas/.ssh/id_dispatcher 2024-02-02 14:16:21.957136273 -0300 @@ -35,4 +35,4 @@ hEqXXf30DlBHzni3jDlhhwLHRDtBivngMi2ng3CNCtlUPIgh67rKa84MlEq48MikiKwrEu LTvsGdIFK4pbFuxzgqDQ44fKx1XP2ibztR0gfgZOLQXeDRHTGybRoB0dhiJ3RAO+fTGeuh q1g3F6Kn4ax2gXAAAADmx1Y2FzQGRlYmlhbjExAQIDBA== ------END OPENSSH PRIVATE KEY----- \ No newline at end of file +-----END OPENSSH PRIVATE KEY-----
So, I added a final newline to the tempfile and started working:
(venv-ansible) lucas@debian11:~/arvados/tools/compute-images$ ssh-keygen -y -f /tmp/ansible.r9yrpm54.keyssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQDNe6NsZsW49XEkeMEjLeoQU7tp36EC31CTAMDOhBDSQbXiK3LbhuLG+25ytL9vhwNFc8MAiEJeiKJoWKwr6bQ1KoSF7rw9/Y4KIScLj/61+WuhxkeGFcbsbzqlM2bwOrtw4zJ+H0jZt5Sivfwa1fZx8MqhOSaw/Yl1Xa2Z3rn0SrV1HNi0DJrv/eQxQmEJSs26WkF8XYixd1o5eRDaYNuJu93G5l8vSpCAJHA/L0XMLXgJVtldmAqxRtpQotTFRy2GZqVA1x6KHWdxdetbCYTABWsmWseFQCGTKzpe3G8vzuuKmODq9rgrr59OBpmGr+wxTksawDuzIoYmaeOqOWVfCygi86168DQ+F7tUybtITHdnR+7L4kbj4aZDPgqHyzSo/QTHRbOWEn9XnF4vnweYcOAFTndgvtkca226dth1Ni6N9PIfdcXVQJeuI9ETWu8TqrncgS5SaefnxhgENPRXDh/L6jmdqs0pJ2wKwQW+rIraM++Isl9KkN7595Sw+zc= lucas@debian11
Not sure why it worked for you, I tested against both an salt installer-created config.yml
and the complete version from arvados-server config-dump
, maybe there's a behavior difference between ssh client versions? The one I have in my test VM is 1:8.4p1-5+deb11u3
Updated by Brett Smith 4 days ago
Lucas Di Pentima wrote in #note-7:
Not sure why it worked for you, I tested against both an salt installer-created
config.yml
and the complete version fromarvados-server config-dump
, maybe there's a behavior difference between ssh client versions?
It is literally a one-character difference. I wrote my config with a |
block in the YAML. The Salt installer (apparently) writes it with |-
, which removes the trailing newline, which causes ssh-keygen
to reject it as invalid, because the OpenSSH developers hate me, personally, and want me to suffer.
Both versions work if we ensure there's a trailing newline. Now at 5fd714ba82d13dc986162e5a8bd31d31b71aa563
Updated by Lucas Di Pentima 4 days ago
Brett Smith wrote in #note-8:
Both versions work if we ensure there's a trailing newline. Now at 5fd714ba82d13dc986162e5a8bd31d31b71aa563
This LGTM, thanks!
Updated by Peter Amstutz 3 days ago
- Target version changed from Development 2025-04-16 to Development 2025-04-30