Project

General

Profile

Actions

Bug #22788

closed

Compute image builder fails when using a salt installer-generated arvados config file

Added by Lucas Di Pentima 4 days ago. Updated 3 days ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Deployment
Target version:
Story points:
-
Release relationship:
Auto

Description

Deployed a new 3.1.1 cluster with the salt installer. Then, ran the compute image build script passing the config.yml file from one of the nodes that includes the SSH dispatcher privkey as arvados_config_file in host_config.yml, also leaving compute_authorized_keys commented out so that it can use the one provided with the config.yml file.

It seems that it's trying to use the privkey defined in the config file as a file path. Changing the config file's Containers.DispatchPrivateKey entry to the privkey file path fixes the issue.

    amazon-ebs: TASK [Load Arvados cluster configuration] **************************************
    amazon-ebs: ok: [default]
    amazon-ebs:
    amazon-ebs: TASK [Get Crunch dispatch public key] ******************************************
    amazon-ebs: fatal: [default -> localhost]: FAILED! => {"changed": true, "cmd": ["ssh-keygen", "-y"], "delta": "0:00:00.003571", "end": "2025-04-15 11:41:40.573289", "msg": "non-zero return code", "rc": 255, "start": "2025-04-15 11:41:40.569718", "stderr": "-----BEGIN OPENSSH PRIVATE KEY-----: No such file or directory", "stderr_lines": ["-----BEGIN OPENSSH PRIVATE KEY-----: No such file or directory"], "stdout": "Enter file in which the key is (/home/lucas/.ssh/id_rsa): ", "stdout_lines": ["Enter file in which the key is (/home/lucas/.ssh/id_rsa): "]}
    amazon-ebs:
    amazon-ebs: PLAY RECAP *********************************************************************
    amazon-ebs: default                    : ok=3    changed=0    unreachable=0    failed=1    skipped=0    rescued=0    ignored=0
    amazon-ebs:
==> amazon-ebs: Provisioning step had errors: Running the cleanup provisioner, if present...
==> amazon-ebs: Terminating the source AWS instance...
==> amazon-ebs: Cleaning up any extra volumes...

Files

22788.log (7.31 KB) 22788.log Brett Smith, 04/15/2025 04:18 PM

Subtasks 1 (0 open1 closed)

Task #22790: Review 22788-ansible-key-fixResolvedLucas Di Pentima04/15/2025Actions
Actions #1

Updated by Lucas Di Pentima 4 days ago

Also tried with a config file from arvados-server config-dump, because the privkey formatting is a bit different, but failed in the same way.

Actions #2

Updated by Brett Smith 4 days ago

22788-ansible-key-fix @ 6ceacf85425f67b372395cdfb01ac593e5c7d5aa - packer-build-compute-image: #328

  • All agreed upon points are implemented / addressed.
    • Yes
  • Anything not implemented (discovered or discussed during work) has a follow-up story.
    • N/A
  • Code is tested and passing, both automated and manual, what manual testing was done is described
    • See above. Tested manually by copying this first play into a separate test playbook, then running it against a test cluster with ansible-playbook -v, with DispatchPrivateKey set to either key content or a file URL. The output from ssh-keygen was the same in both cases.
  • Documentation has been updated.
    • N/A
  • Behaves appropriately at the intended scale (describe intended scale).
    • N/A
  • Considered backwards and forwards compatibility issues between client and server.
    • N/A
  • Follows our coding standards and GUI style guidelines.
    • N/A (no Ansible style guide)
Actions #3

Updated by Brett Smith 4 days ago

  • Target version changed from Development 2025-04-30 to Development 2025-04-16
  • Assigned To set to Brett Smith
  • Status changed from New to In Progress
Actions #4

Updated by Brett Smith 4 days ago

  • Subtask #22790 added
Actions #5

Updated by Brett Smith 4 days ago

Unfortunately the tordo build doesn't really exercise this functionality. Since the Jenkins job relies on the public cluster config, it doesn't have the dispatch key there, and instead loads it separately through an Ansible variable. The build shows I didn't add a syntax error or anything, but unfortunately manual testing is all we have to go on. See my log attached. You'll just have to trust I didn't forge it.

Actions #6

Updated by Brett Smith 4 days ago

The build failure is unrelated, instead it happened because the newest versions of CUDA move files around in a way that breaks our Ansible playbook. We'll need to update it accordingly. Filed #22792.

This generally won't affect production builds because they pin to CUDA 560. tordo does not pin because it's the development cluster where we're specifically looking to shake out issues like this.

Actions #7

Updated by Lucas Di Pentima 4 days ago

I tested with the same arvados config file as before, and I got this error:

    amazon-ebs: TASK [Save dispatch private key to tempfile] ***********************************
    amazon-ebs: changed: [default -> localhost]
    amazon-ebs:
    amazon-ebs: TASK [Derive dispatch public key] **********************************************
    amazon-ebs: fatal: [default -> localhost]: FAILED! => {"changed": true, "cmd": ["ssh-keygen", "-y", "-f", "/tmp/ansible.q6qynbjs.key"], "delta": "0:00:00.001825", "end": "2025-04-15 14:09:26.816157", "msg": "non-zero return code", "rc": 255, "start": "2025-04-15 14:09:26.814332", "stderr": "Load key \"/tmp/ansible.q6qynbjs.key\": invalid format", "stderr_lines": ["Load key \"/tmp/ansible.q6qynbjs.key\": invalid format"], "stdout": "", "stdout_lines": []}
    amazon-ebs:
    amazon-ebs: TASK [Remove dispatch private key tempfile] ************************************
    amazon-ebs: changed: [default -> localhost]
    amazon-ebs:
    amazon-ebs: PLAY RECAP *********************************************************************
    amazon-ebs: default                    : ok=6    changed=3    unreachable=0    failed=1    skipped=0    rescued=0    ignored=0

So, I commented out the "Remove dispatch private key tempfile" step and looked at what was producing as a tempfile. It seemed OK but i still was getting the same error:

(venv-ansible) lucas@debian11:~/arvados/tools/compute-images$ ssh-keygen -y -f /tmp/ansible.r9yrpm54.key
Load key "/tmp/ansible.r9yrpm54.key": invalid format

Then I tried diffing it against the original:

(venv-ansible) lucas@debian11:~/arvados/tools/compute-images$ diff -Naur /tmp/ansible.r9yrpm54.key ~/.ssh/id_dispatcher
--- /tmp/ansible.r9yrpm54.key    2025-04-15 14:25:38.576163327 -0300
+++ /home/lucas/.ssh/id_dispatcher    2024-02-02 14:16:21.957136273 -0300
@@ -35,4 +35,4 @@
 hEqXXf30DlBHzni3jDlhhwLHRDtBivngMi2ng3CNCtlUPIgh67rKa84MlEq48MikiKwrEu
 LTvsGdIFK4pbFuxzgqDQ44fKx1XP2ibztR0gfgZOLQXeDRHTGybRoB0dhiJ3RAO+fTGeuh
 q1g3F6Kn4ax2gXAAAADmx1Y2FzQGRlYmlhbjExAQIDBA==
------END OPENSSH PRIVATE KEY-----
\ No newline at end of file
+-----END OPENSSH PRIVATE KEY-----

So, I added a final newline to the tempfile and started working:

(venv-ansible) lucas@debian11:~/arvados/tools/compute-images$ ssh-keygen -y -f /tmp/ansible.r9yrpm54.keyssh-rsa 
AAAAB3NzaC1yc2EAAAADAQABAAABgQDNe6NsZsW49XEkeMEjLeoQU7tp36EC31CTAMDOhBDSQbXiK3LbhuLG+25ytL9vhwNFc8MAiEJeiKJoWKwr6bQ1KoSF7rw9/Y4KIScLj/61+WuhxkeGFcbsbzqlM2bwOrtw4zJ+H0jZt5Sivfwa1fZx8MqhOSaw/Yl1Xa2Z3rn0SrV1HNi0DJrv/eQxQmEJSs26WkF8XYixd1o5eRDaYNuJu93G5l8vSpCAJHA/L0XMLXgJVtldmAqxRtpQotTFRy2GZqVA1x6KHWdxdetbCYTABWsmWseFQCGTKzpe3G8vzuuKmODq9rgrr59OBpmGr+wxTksawDuzIoYmaeOqOWVfCygi86168DQ+F7tUybtITHdnR+7L4kbj4aZDPgqHyzSo/QTHRbOWEn9XnF4vnweYcOAFTndgvtkca226dth1Ni6N9PIfdcXVQJeuI9ETWu8TqrncgS5SaefnxhgENPRXDh/L6jmdqs0pJ2wKwQW+rIraM++Isl9KkN7595Sw+zc= lucas@debian11

Not sure why it worked for you, I tested against both an salt installer-created config.yml and the complete version from arvados-server config-dump, maybe there's a behavior difference between ssh client versions? The one I have in my test VM is 1:8.4p1-5+deb11u3

Actions #8

Updated by Brett Smith 4 days ago

Lucas Di Pentima wrote in #note-7:

Not sure why it worked for you, I tested against both an salt installer-created config.yml and the complete version from arvados-server config-dump, maybe there's a behavior difference between ssh client versions?

It is literally a one-character difference. I wrote my config with a | block in the YAML. The Salt installer (apparently) writes it with |-, which removes the trailing newline, which causes ssh-keygen to reject it as invalid, because the OpenSSH developers hate me, personally, and want me to suffer.

Both versions work if we ensure there's a trailing newline. Now at 5fd714ba82d13dc986162e5a8bd31d31b71aa563

Actions #9

Updated by Lucas Di Pentima 4 days ago

Brett Smith wrote in #note-8:

Both versions work if we ensure there's a trailing newline. Now at 5fd714ba82d13dc986162e5a8bd31d31b71aa563

This LGTM, thanks!

Actions #10

Updated by Peter Amstutz 3 days ago

  • Target version changed from Development 2025-04-16 to Development 2025-04-30
Actions #11

Updated by Brett Smith 3 days ago

  • Status changed from In Progress to Resolved
Actions

Also available in: Atom PDF