Project

General

Profile

Actions

Bug #22217

closed

Compute node build script fails to SSH into more recent Debian/Ubuntu

Added by Brett Smith 3 months ago. Updated 2 months ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Deployment
Target version:
Story points:
-

Description

This might be as simple as one misconfiguration, but if you take our current compute node build script and try to build an image on top of a Debian 12 AMI, it never manages to SSH in to start the build work. This error repeats until it gives up:

2024/10/18 01:27:58 packer-builder-amazon-ebs plugin: Using host value: 54.175.47.1
2024/10/18 01:27:58 packer-builder-amazon-ebs plugin: [INFO] Attempting SSH connection to 54.175.47.1:22...
2024/10/18 01:27:58 packer-builder-amazon-ebs plugin: [DEBUG] reconnecting to TCP connection for SSH
2024/10/18 01:27:58 packer-builder-amazon-ebs plugin: [DEBUG] handshaking with SSH
2024/10/18 01:27:58 packer-builder-amazon-ebs plugin: [DEBUG] SSH handshake err: ssh: handshake failed: ssh: unable to authenticate, attempted methods [none publickey], no supported methods remain
2024/10/18 01:27:58 packer-builder-amazon-ebs plugin: [DEBUG] Detected authentication error. Increasing handshake attempts.
==> amazon-ebs: Error waiting for SSH: Packer experienced an authentication error when trying to connect via SSH. This can happen if your username/password are wrong. You may want to double-check your credentials as part of your debugging process. original error: ssh: handshake failed: ssh: unable to authenticate, attempted methods [none publickey], no supported methods remain

I've checked that admin is still the correct username at least.


Files


Subtasks 1 (0 open1 closed)

Task #22251: Review 22217-packer-keypair-typeResolvedBrett Smith11/12/2024Actions

Related issues 1 (1 open0 closed)

Precedes Arvados - Feature #22315: Compute node build script supports Ubuntu AMIs without modificationNewActions
Actions #2

Updated by Brett Smith 3 months ago

Same problem happens with Ubuntu 22.04 using the ubuntu user.

Actions #3

Updated by Peter Amstutz 3 months ago

  • Target version set to Development 2024-11-06 sprint

Look into this post-3.0, candidate for a 3.0.1 fix.

Actions #4

Updated by Lucas Di Pentima 3 months ago

  • Assigned To set to Lucas Di Pentima
Actions #5

Updated by Peter Amstutz 3 months ago

  • Assigned To changed from Lucas Di Pentima to Brett Smith
Actions #6

Updated by Peter Amstutz 3 months ago

  • Target version changed from Development 2024-11-06 sprint to Development 2024-11-20
Actions #7

Updated by Brett Smith 3 months ago

  • Status changed from New to In Progress

First guess is that we're using a key type that is no longer supported by default in modern OpenSSH. e.g., see the announcement for OpenSSH 8.8 in /usr/share/doc/openssh-client/NEWS.Debian.gz. However, I'm still following the thread to figure out what key we're even using for the initial connection to be able to confirm that.

The good news is if I'm right about this it really is just a local configuration issue, not a code problem.

Actions #8

Updated by Brett Smith 2 months ago

packer-build-compute-image: #279 shows a branch that successfully SSHes into debian12, so I correctly diagnosed and addressed the problem. Now there are unrelated compute node build problems in debian12 to fix.

Actions #9

Updated by Brett Smith 2 months ago

22217-packer-keypair-type @ 053d1aa91e0340eb275685b9ded8466a773a49b8

I have used this branch to successfully build AMIs based on both Debian 11 and 12. Packer errored out during the "waiting for AMI to become available" step, but the AMIs are available and usable in our account, so whatever the problem was isn't fatal.

Debian 11 AMI build: packer-build-compute-image: #289 - ami-03a1bd160ef4e8889

Debian 12 AMI build: packer-build-compute-image: #288 - ami-0a7d80ba0826f802e

I started testing based on Ubuntu, but I have reason to believe these scripts have never been used to build Ubuntu-based images, at least not without modification. source:tools/compute-images/arvados-images-aws.json configures block devices with names following /dev/xvd_ style, but the base Ubuntu AMIs use /dev/sdaN style. The current names in that JSON mean we don't reconfigure the root device as intended on Ubuntu, and the build runs out of space. You can at least see from packer-build-compute-image: #293 that the branch gets far enough to start installing NVIDIA packages, so there's no regression up to that point.

The key commits that actually fix bugs are 33bc2174d9f6b6d27620a812c21507b8d0a8a9ae, 8bbfc28b363baca9d0dd2fec24e301e2ebaeb311, 55511a45a26a8678bb0348fd4f1ed0b1aa5a5691, and 348fe2270fa686cbb57bae365829d5c340df0cac. Everything else is clean-up I did along the way to make the script more maintainable or easier to run manually in a debugging environment, to help me find and fix the real bugs. But I see no reason not to keep them.

  • All agreed upon points are implemented / addressed.
    • Yes
  • Anything not implemented (discovered or discussed during work) has a follow-up story.
    • If we want Ubuntu support out of the box we can make a follow-up story
  • Code is tested and passing, both automated and manual, what manual testing was done is described
    • I've only tested that the build succeed. If we want to test that these AMIs can do compute work, that will be additional work by hand. I am willing to do it but I would like someone to tell me what the expected testing matrix is. Are there particluar workflows that need to pass? Do I need to test a workflow that uses CUDA? Do I need to test both Docker and Singularity?
  • Documentation has been updated.
    • N/A
  • Behaves appropriately at the intended scale (describe intended scale).
    • No change in scale. Marginally improves build time by reducing the number of apt-get invocations.
  • Considered backwards and forwards compatibility issues between client and server.
    • Only improves forwards distro compatibility without breaking backwards compatibility.
  • Follows our coding standards and GUI style guidelines.
    • N/A (no relevant style guide)
Actions #10

Updated by Brett Smith 2 months ago

  • Precedes Feature #22315: Compute node build script supports Ubuntu AMIs without modification added
Actions #11

Updated by Lucas Di Pentima 2 months ago

Just one suggestion: the base.sh script at line 112: Could benefit from using download_and_install()

Re: AMI building pipeline failure due to timeouts, it seems that we could try customizing the waiting, although I haven't been able to see which are our current values, seems like they're setup to timeout after 30 minutes.

I've read that to minimize AMI building time, we could use a layered approach where we build AMIs with less frequently updated software on the lower layers, and probably CUDA would be an ideal candidate for this (it's taking several GiBs). Just a comment to be considered for future work.

Other than the above suggestion, this LGTM.

Actions #12

Updated by Brett Smith 2 months ago

  • Status changed from In Progress to Resolved
Actions #13

Updated by Brett Smith 2 months ago

Lucas Di Pentima wrote in #note-11:

Re: AMI building pipeline failure due to timeouts, it seems that we could try customizing the waiting, although I haven't been able to see which are our current values, seems like they're setup to timeout after 30 minutes.

For what we're doing specifically, I don't see any reason why we need to wait for the AMI to become available at all. As long as the build succeeds so that it will become available eventually, I think that's sufficient for our purposes. It's not like we have some automation immediately following that requires the AMI to be available at that point.

Actions

Also available in: Atom PDF