Bug #22217
closedCompute node build script fails to SSH into more recent Debian/Ubuntu
Description
This might be as simple as one misconfiguration, but if you take our current compute node build script and try to build an image on top of a Debian 12 AMI, it never manages to SSH in to start the build work. This error repeats until it gives up:
2024/10/18 01:27:58 packer-builder-amazon-ebs plugin: Using host value: 54.175.47.1 2024/10/18 01:27:58 packer-builder-amazon-ebs plugin: [INFO] Attempting SSH connection to 54.175.47.1:22... 2024/10/18 01:27:58 packer-builder-amazon-ebs plugin: [DEBUG] reconnecting to TCP connection for SSH 2024/10/18 01:27:58 packer-builder-amazon-ebs plugin: [DEBUG] handshaking with SSH 2024/10/18 01:27:58 packer-builder-amazon-ebs plugin: [DEBUG] SSH handshake err: ssh: handshake failed: ssh: unable to authenticate, attempted methods [none publickey], no supported methods remain 2024/10/18 01:27:58 packer-builder-amazon-ebs plugin: [DEBUG] Detected authentication error. Increasing handshake attempts. ==> amazon-ebs: Error waiting for SSH: Packer experienced an authentication error when trying to connect via SSH. This can happen if your username/password are wrong. You may want to double-check your credentials as part of your debugging process. original error: ssh: handshake failed: ssh: unable to authenticate, attempted methods [none publickey], no supported methods remain
I've checked that admin
is still the correct username at least.
Files
Related issues
Updated by Brett Smith about 1 month ago
Updated by Brett Smith about 1 month ago
- Subject changed from Compute node build script fails to SSH into Debian 12 to Compute node build script fails to SSH into more recent Debian/Ubuntu
- File packer-build-compute-image-267.log packer-build-compute-image-267.log added
Same problem happens with Ubuntu 22.04 using the ubuntu
user.
Updated by Peter Amstutz about 1 month ago
- Target version set to Development 2024-11-06 sprint
Look into this post-3.0, candidate for a 3.0.1 fix.
Updated by Peter Amstutz 29 days ago
- Assigned To changed from Lucas Di Pentima to Brett Smith
Updated by Peter Amstutz 15 days ago
- Target version changed from Development 2024-11-06 sprint to Development 2024-11-20
Updated by Brett Smith 13 days ago
- Status changed from New to In Progress
First guess is that we're using a key type that is no longer supported by default in modern OpenSSH. e.g., see the announcement for OpenSSH 8.8 in /usr/share/doc/openssh-client/NEWS.Debian.gz
. However, I'm still following the thread to figure out what key we're even using for the initial connection to be able to confirm that.
The good news is if I'm right about this it really is just a local configuration issue, not a code problem.
Updated by Brett Smith 13 days ago
packer-build-compute-image: #279 shows a branch that successfully SSHes into debian12, so I correctly diagnosed and addressed the problem. Now there are unrelated compute node build problems in debian12 to fix.
Updated by Brett Smith 10 days ago
22217-packer-keypair-type @ 053d1aa91e0340eb275685b9ded8466a773a49b8
I have used this branch to successfully build AMIs based on both Debian 11 and 12. Packer errored out during the "waiting for AMI to become available" step, but the AMIs are available and usable in our account, so whatever the problem was isn't fatal.
Debian 11 AMI build: packer-build-compute-image: #289 - ami-03a1bd160ef4e8889
Debian 12 AMI build: packer-build-compute-image: #288 - ami-0a7d80ba0826f802e
I started testing based on Ubuntu, but I have reason to believe these scripts have never been used to build Ubuntu-based images, at least not without modification. source:tools/compute-images/arvados-images-aws.json configures block devices with names following /dev/xvd_
style, but the base Ubuntu AMIs use /dev/sdaN
style. The current names in that JSON mean we don't reconfigure the root device as intended on Ubuntu, and the build runs out of space. You can at least see from packer-build-compute-image: #293 that the branch gets far enough to start installing NVIDIA packages, so there's no regression up to that point.
The key commits that actually fix bugs are 33bc2174d9f6b6d27620a812c21507b8d0a8a9ae, 8bbfc28b363baca9d0dd2fec24e301e2ebaeb311, 55511a45a26a8678bb0348fd4f1ed0b1aa5a5691, and 348fe2270fa686cbb57bae365829d5c340df0cac. Everything else is clean-up I did along the way to make the script more maintainable or easier to run manually in a debugging environment, to help me find and fix the real bugs. But I see no reason not to keep them.
- All agreed upon points are implemented / addressed.
- Yes
- Anything not implemented (discovered or discussed during work) has a follow-up story.
- If we want Ubuntu support out of the box we can make a follow-up story
- Code is tested and passing, both automated and manual, what manual testing was done is described
- I've only tested that the build succeed. If we want to test that these AMIs can do compute work, that will be additional work by hand. I am willing to do it but I would like someone to tell me what the expected testing matrix is. Are there particluar workflows that need to pass? Do I need to test a workflow that uses CUDA? Do I need to test both Docker and Singularity?
- Documentation has been updated.
- N/A
- Behaves appropriately at the intended scale (describe intended scale).
- No change in scale. Marginally improves build time by reducing the number of apt-get invocations.
- Considered backwards and forwards compatibility issues between client and server.
- Only improves forwards distro compatibility without breaking backwards compatibility.
- Follows our coding standards and GUI style guidelines.
- N/A (no relevant style guide)
Updated by Brett Smith 9 days ago
- Precedes Feature #22315: Compute node build script supports Ubuntu AMIs without modification added
Updated by Lucas Di Pentima 9 days ago
Just one suggestion: the base.sh
script at line 112: Could benefit from using download_and_install()
Re: AMI building pipeline failure due to timeouts, it seems that we could try customizing the waiting, although I haven't been able to see which are our current values, seems like they're setup to timeout after 30 minutes.
I've read that to minimize AMI building time, we could use a layered approach where we build AMIs with less frequently updated software on the lower layers, and probably CUDA would be an ideal candidate for this (it's taking several GiBs). Just a comment to be considered for future work.
Other than the above suggestion, this LGTM.
Updated by Brett Smith 9 days ago
- Status changed from In Progress to Resolved
Applied in changeset arvados|829f6e86cc1536d3b2c02ea59a3c6effd61fc8c8.
Updated by Brett Smith 9 days ago
Lucas Di Pentima wrote in #note-11:
Re: AMI building pipeline failure due to timeouts, it seems that we could try customizing the waiting, although I haven't been able to see which are our current values, seems like they're setup to timeout after 30 minutes.
For what we're doing specifically, I don't see any reason why we need to wait for the AMI to become available at all. As long as the build succeeds so that it will become available eventually, I think that's sufficient for our purposes. It's not like we have some automation immediately following that requires the AMI to be available at that point.