Feature #22563
closedcompute node ansible playbook to install ROCm
Added by Peter Amstutz about 1 month ago. Updated 19 days ago.
Description
This is pretty straightforward: get the package signing key, set up two 3rd party debian repos (one for the driver, one for the ROCm tools), then install "amdgpu-dkms" and "rocm" packages.
Apparently each ROCm version gets its own package, so "rocm" is actually just a metapackage pointing to the latest, which is called "rocmX.Y.Z" (e.g. "rocm6.3.2").
(The stated reason is to support installing multiple versions of ROCm for testing).
Files
amdrocm.log (535 KB) amdrocm.log | Brett Smith, 02/17/2025 06:37 PM |
Updated by Peter Amstutz about 1 month ago
- Position changed from -937627 to -937625
Updated by Peter Amstutz about 1 month ago
- Subject changed from compute node ansible can install ROCm to compute node ansible playbook to install ROCm
Updated by Peter Amstutz about 1 month ago
- Related to Support #22562: Test running CUDA tordo with updated pins added
Updated by Brett Smith about 1 month ago
- Assigned To set to Brett Smith
- Status changed from New to In Progress
I have started a branch for this, but note I'm building it on top of the branch for #22489 since that has a lot of Ansible churn and I don't want to write merge conflicts against myself.
Please let me know what version of ROCm you've been testing so I can pin that.
Updated by Peter Amstutz about 1 month ago
Brett Smith wrote in #note-6:
I have started a branch for this, but note I'm building it on top of the branch for #22489 since that has a lot of Ansible churn and I don't want to write merge conflicts against myself.
Please let me know what version of ROCm you've been testing so I can pin that.
Looks like rocm-6.3.0
Updated by Peter Amstutz about 1 month ago
- Related to Feature #21926: AMD ROCm GPU support added
Updated by Peter Amstutz 29 days ago
Slot: 00:00.0
Class: Host bridge
Vendor: Advanced Micro Devices, Inc. [AMD]
Device: Phoenix Root Complex
SVendor: Unknown vendor 1f4c
SDevice: Device b016
Slot: 00:00.2
Class: IOMMU
Vendor: Advanced Micro Devices, Inc. [AMD]
Device: Phoenix IOMMU
SVendor: Unknown vendor 1f4c
SDevice: Device b016
Slot: 00:01.0
Class: Host bridge
Vendor: Advanced Micro Devices, Inc. [AMD]
Device: Phoenix Dummy Host Bridge
IOMMUGroup: 0
Slot: 00:01.2
Class: PCI bridge
Vendor: Advanced Micro Devices, Inc. [AMD]
Device: Phoenix GPP Bridge
IOMMUGroup: 1
Slot: 00:02.0
Class: Host bridge
Vendor: Advanced Micro Devices, Inc. [AMD]
Device: Phoenix Dummy Host Bridge
IOMMUGroup: 2
Slot: 00:02.1
Class: PCI bridge
Vendor: Advanced Micro Devices, Inc. [AMD]
Device: Phoenix GPP Bridge
IOMMUGroup: 3
Slot: 00:02.2
Class: PCI bridge
Vendor: Advanced Micro Devices, Inc. [AMD]
Device: Phoenix GPP Bridge
IOMMUGroup: 4
Slot: 00:02.3
Class: PCI bridge
Vendor: Advanced Micro Devices, Inc. [AMD]
Device: Phoenix GPP Bridge
IOMMUGroup: 5
Slot: 00:02.4
Class: PCI bridge
Vendor: Advanced Micro Devices, Inc. [AMD]
Device: Phoenix GPP Bridge
IOMMUGroup: 6
Slot: 00:03.0
Class: Host bridge
Vendor: Advanced Micro Devices, Inc. [AMD]
Device: Phoenix Dummy Host Bridge
IOMMUGroup: 7
Slot: 00:03.1
Class: PCI bridge
Vendor: Advanced Micro Devices, Inc. [AMD]
Device: Family 19h USB4/Thunderbolt PCIe tunnel
IOMMUGroup: 7
Slot: 00:04.0
Class: Host bridge
Vendor: Advanced Micro Devices, Inc. [AMD]
Device: Phoenix Dummy Host Bridge
IOMMUGroup: 8
Slot: 00:04.1
Class: PCI bridge
Vendor: Advanced Micro Devices, Inc. [AMD]
Device: Family 19h USB4/Thunderbolt PCIe tunnel
IOMMUGroup: 8
Slot: 00:08.0
Class: Host bridge
Vendor: Advanced Micro Devices, Inc. [AMD]
Device: Phoenix Dummy Host Bridge
IOMMUGroup: 9
Slot: 00:08.1
Class: PCI bridge
Vendor: Advanced Micro Devices, Inc. [AMD]
Device: Phoenix Internal GPP Bridge to Bus [C:A]
IOMMUGroup: 10
Slot: 00:08.2
Class: PCI bridge
Vendor: Advanced Micro Devices, Inc. [AMD]
Device: Phoenix Internal GPP Bridge to Bus [C:A]
IOMMUGroup: 11
Slot: 00:08.3
Class: PCI bridge
Vendor: Advanced Micro Devices, Inc. [AMD]
Device: Phoenix Internal GPP Bridge to Bus [C:A]
IOMMUGroup: 12
Slot: 00:14.0
Class: SMBus
Vendor: Advanced Micro Devices, Inc. [AMD]
Device: FCH SMBus Controller
SVendor: Unknown vendor 1f4c
SDevice: FCH SMBus Controller
Rev: 71
IOMMUGroup: 13
Slot: 00:14.3
Class: ISA bridge
Vendor: Advanced Micro Devices, Inc. [AMD]
Device: FCH LPC Bridge
SVendor: Unknown vendor 1f4c
SDevice: FCH LPC Bridge
Rev: 51
IOMMUGroup: 13
Slot: 00:18.0
Class: Host bridge
Vendor: Advanced Micro Devices, Inc. [AMD]
Device: Phoenix Data Fabric; Function 0
IOMMUGroup: 14
Slot: 00:18.1
Class: Host bridge
Vendor: Advanced Micro Devices, Inc. [AMD]
Device: Phoenix Data Fabric; Function 1
IOMMUGroup: 14
Slot: 00:18.2
Class: Host bridge
Vendor: Advanced Micro Devices, Inc. [AMD]
Device: Phoenix Data Fabric; Function 2
IOMMUGroup: 14
Slot: 00:18.3
Class: Host bridge
Vendor: Advanced Micro Devices, Inc. [AMD]
Device: Phoenix Data Fabric; Function 3
IOMMUGroup: 14
Slot: 00:18.4
Class: Host bridge
Vendor: Advanced Micro Devices, Inc. [AMD]
Device: Phoenix Data Fabric; Function 4
IOMMUGroup: 14
Slot: 00:18.5
Class: Host bridge
Vendor: Advanced Micro Devices, Inc. [AMD]
Device: Phoenix Data Fabric; Function 5
IOMMUGroup: 14
Slot: 00:18.6
Class: Host bridge
Vendor: Advanced Micro Devices, Inc. [AMD]
Device: Phoenix Data Fabric; Function 6
IOMMUGroup: 14
Slot: 00:18.7
Class: Host bridge
Vendor: Advanced Micro Devices, Inc. [AMD]
Device: Phoenix Data Fabric; Function 7
IOMMUGroup: 14
Slot: 01:00.0
Class: Non-Volatile memory controller
Vendor: MAXIO Technology (Hangzhou) Ltd.
Device: NVMe SSD Controller MAP1602 (DRAM-less)
SVendor: MAXIO Technology (Hangzhou) Ltd.
SDevice: NVMe SSD Controller MAP1602 (DRAM-less)
Rev: 01
ProgIf: 02
IOMMUGroup: 15
Slot: 02:00.0
Class: Ethernet controller
Vendor: Realtek Semiconductor Co., Ltd.
Device: RTL8125 2.5GbE Controller
SVendor: Unknown vendor 1f4c
SDevice: RTL8125 2.5GbE Controller
Rev: 05
IOMMUGroup: 16
Slot: 03:00.0
Class: Ethernet controller
Vendor: Realtek Semiconductor Co., Ltd.
Device: RTL8125 2.5GbE Controller
SVendor: Unknown vendor 1f4c
SDevice: RTL8125 2.5GbE Controller
Rev: 05
IOMMUGroup: 17
Slot: 04:00.0
Class: Network controller
Vendor: Intel Corporation
Device: Wi-Fi 6E(802.11ax) AX210/AX1675* 2x2 [Typhoon Peak]
SVendor: Rivet Networks
SDevice: Killer Wi-Fi 6E AX1675x 160MHz
Rev: 1a
IOMMUGroup: 18
Slot: 05:00.0
Class: PCI bridge
Vendor: Advanced Micro Devices, Inc. [AMD/ATI]
Device: Navi 10 XL Upstream Port of PCI Express Switch
Rev: 10
IOMMUGroup: 19
Slot: 06:00.0
Class: PCI bridge
Vendor: Advanced Micro Devices, Inc. [AMD/ATI]
Device: Navi 10 XL Downstream Port of PCI Express Switch
Rev: 10
IOMMUGroup: 20
Slot: 07:00.0
Class: VGA compatible controller
Vendor: Advanced Micro Devices, Inc. [AMD/ATI]
Device: Navi 31 [Radeon RX 7900 XT/7900 XTX/7900 GRE/7900M]
SVendor: Advanced Micro Devices, Inc. [AMD/ATI]
SDevice: Device 1002
Rev: cc
IOMMUGroup: 21
Slot: 07:00.1
Class: Audio device
Vendor: Advanced Micro Devices, Inc. [AMD/ATI]
Device: Navi 31 HDMI/DP Audio
SVendor: Advanced Micro Devices, Inc. [AMD/ATI]
SDevice: Navi 31 HDMI/DP Audio
IOMMUGroup: 22
Slot: 07:00.2
Class: USB controller
Vendor: Advanced Micro Devices, Inc. [AMD/ATI]
Device: Navi 31 USB
SVendor: Advanced Micro Devices, Inc. [AMD/ATI]
SDevice: Navi 31 USB
ProgIf: 30
IOMMUGroup: 23
Slot: 07:00.3
Class: Serial bus controller
Vendor: Advanced Micro Devices, Inc. [AMD/ATI]
Device: Device 7444
SVendor: Advanced Micro Devices, Inc. [AMD/ATI]
SDevice: Device 0408
IOMMUGroup: 24
Slot: c8:00.0
Class: VGA compatible controller
Vendor: Advanced Micro Devices, Inc. [AMD/ATI]
Device: Phoenix1
SVendor: Unknown vendor 1f4c
SDevice: Device b016
Rev: c2
IOMMUGroup: 25
Slot: c8:00.1
Class: Audio device
Vendor: Advanced Micro Devices, Inc. [AMD/ATI]
Device: Rembrandt Radeon High Definition Audio Controller
SVendor: Unknown vendor 1f4c
SDevice: Device b016
IOMMUGroup: 26
Slot: c8:00.2
Class: Encryption controller
Vendor: Advanced Micro Devices, Inc. [AMD]
Device: Phoenix CCP/PSP 3.0 Device
SVendor: Unknown vendor 1f4c
SDevice: Device b016
IOMMUGroup: 27
Slot: c8:00.3
Class: USB controller
Vendor: Advanced Micro Devices, Inc. [AMD]
Device: Device 15b9
SVendor: Unknown vendor 1f4c
SDevice: Device b016
ProgIf: 30
IOMMUGroup: 28
Slot: c8:00.4
Class: USB controller
Vendor: Advanced Micro Devices, Inc. [AMD]
Device: Device 15ba
SVendor: Unknown vendor 1f4c
SDevice: Device b016
ProgIf: 30
IOMMUGroup: 29
Slot: c8:00.5
Class: Multimedia controller
Vendor: Advanced Micro Devices, Inc. [AMD]
Device: ACP/ACP3X/ACP6x Audio Coprocessor
SVendor: Unknown vendor 1f4c
SDevice: Raven/Raven2/FireFlight/Renoir Audio Processor
Rev: 63
IOMMUGroup: 30
Slot: c8:00.6
Class: Audio device
Vendor: Advanced Micro Devices, Inc. [AMD]
Device: Family 17h/19h/1ah HD Audio Controller
SVendor: Unknown vendor 1f4c
SDevice: Family 17h (Models 10h-1fh) HD Audio Controller
IOMMUGroup: 31
Slot: c9:00.0
Class: Non-Essential Instrumentation [1300]
Vendor: Advanced Micro Devices, Inc. [AMD]
Device: Phoenix Dummy Function
SVendor: Unknown vendor 1f4c
SDevice: Device b016
IOMMUGroup: 32
Slot: c9:00.1
Class: Signal processing controller
Vendor: Advanced Micro Devices, Inc. [AMD]
Device: AMD IPU Device
SVendor: Unknown vendor 1f4c
SDevice: Device b016
IOMMUGroup: 33
Slot: ca:00.0
Class: Non-Essential Instrumentation [1300]
Vendor: Advanced Micro Devices, Inc. [AMD]
Device: Phoenix Dummy Function
SVendor: Unknown vendor 1f4c
SDevice: Device b016
IOMMUGroup: 34
Slot: ca:00.3
Class: USB controller
Vendor: Advanced Micro Devices, Inc. [AMD]
Device: Device 15c0
SVendor: Unknown vendor 1f4c
SDevice: Device b016
ProgIf: 30
IOMMUGroup: 35
Slot: ca:00.4
Class: USB controller
Vendor: Advanced Micro Devices, Inc. [AMD]
Device: Device 15c1
SVendor: Unknown vendor 1f4c
SDevice: Device b016
ProgIf: 30
IOMMUGroup: 36
Slot: ca:00.5
Class: USB controller
Vendor: Advanced Micro Devices, Inc. [AMD]
Device: Pink Sardine USB4/Thunderbolt NHI controller #1
SVendor: Unknown vendor 1f4c
SDevice: Device b016
ProgIf: 40
IOMMUGroup: 37
Slot: ca:00.6
Class: USB controller
Vendor: Advanced Micro Devices, Inc. [AMD]
Device: Pink Sardine USB4/Thunderbolt NHI controller #2
SVendor: Unknown vendor 1f4c
SDevice: Device b016
ProgIf: 40
IOMMUGroup: 38
Updated by Brett Smith 26 days ago
22563-ansible-rocm @ f50087ae2e8e3f7906b63628a637d71c02d08e3b
- All agreed upon points are implemented / addressed.
- Yes
- Anything not implemented (discovered or discussed during work) has a follow-up story.
- N/A
- Code is tested and passing, both automated and manual, what manual testing was done is described
- The natural test would be to build a new tordo compute node with this playbook. That's not an option because our compute nodes are currently built on Debian 11 and AMD does not publish repositories for that distribution. I have run
build-compute-image.yml
on a plain Debian 12 VM following the documented instructions and attached the log of that run showing it working.
- The natural test would be to build a new tordo compute node with this playbook. That's not an option because our compute nodes are currently built on Debian 11 and AMD does not publish repositories for that distribution. I have run
- Documentation has been updated.
- Added the new enabling flag with explanatory comment to source:tools/compute-images/host_config.example.yml
- Behaves appropriately at the intended scale (describe intended scale).
- N/A
- Considered backwards and forwards compatibility issues between client and server.
- N/A
- Follows our coding standards and GUI style guidelines.
- N/A (no Ansible style guide)
Updated by Peter Amstutz 26 days ago
Brett Smith wrote in #note-12:
22563-ansible-rocm @ f50087ae2e8e3f7906b63628a637d71c02d08e3b
- All agreed upon points are implemented / addressed.
- Yes
- Anything not implemented (discovered or discussed during work) has a follow-up story.
- N/A
- Code is tested and passing, both automated and manual, what manual testing was done is described
- The natural test would be to build a new tordo compute node with this playbook. That's not an option because our compute nodes are currently built on Debian 11 and AMD does not publish repositories for that distribution. I have run
build-compute-image.yml
on a plain Debian 12 VM following the documented instructions and attached the log of that run showing it working.- Documentation has been updated.
- Added the new enabling flag with explanatory comment to source:tools/compute-images/host_config.example.yml
- Behaves appropriately at the intended scale (describe intended scale).
- N/A
- Considered backwards and forwards compatibility issues between client and server.
- N/A
- Follows our coding standards and GUI style guidelines.
- N/A (no Ansible style guide)
So, the only thought I had was that we might want to support installing specific versions and not just latest, but we can leave that for a future work if we find we actually need it (doesn't look like CUDA playbook does that either).
LGTM.
Updated by Brett Smith 25 days ago
Peter Amstutz wrote in #note-14:
So, the only thought I had was that we might want to support installing specific versions and not just latest, but we can leave that for a future work if we find we actually need it (doesn't look like CUDA playbook does that either).
This is already supported. The implementation is a little odd, but in fairness to me, that's because AMD's repository layout is too.
- The
arvados_compute_amd_rocm_version
variable lets you set a string with the version number you want, or the special valuelatest
. This is6.3.2
by default since you said you were testing with 6.3.0 and we generally don't sweat minor version bumps until the software gives us a reason to. - When we set up the
amdgpu
androcm
apt repositories, this variable goes into the URLs. I have never seen another apt repository set up this way, where the repository URL itself hardcodes a package version, but that's how AMD does it.
With the "pin" done in the apt URL, we don't need to pin anywhere else, so this is the most DRY approach. I have double-checked this works the way you would hope it does:
$ curl -fLO 'https://repo.radeon.com/rocm/apt/6.3.1/dists/jammy/main/binary-amd64/Packages.gz' $ zcat Packages.gz | sed -n '/^Package: rocm$/,/^Version: / p' Package: rocm […] Version: 6.3.1.60301-48~22.04
This generic rocm package matches the version in the repository URL, not the actual latest (6.3.2). I'll add comments and merge, thanks.
Updated by Brett Smith 25 days ago
- Status changed from In Progress to Resolved
Applied in changeset arvados|8f6fbce64046ebe408b72fa345675230adad0968.
Updated by Brett Smith 22 days ago
- Status changed from Resolved to In Progress
22563-rocm-disk-size @ 333d0fcd754a6062b0317484c8f8c30f53ea06f5
Compute node build: packer-build-compute-image: #314
Workflow run: tordo-xvhdp-qa6br9r7kqdr299
This branch fills in some gaps discovered when trying to deploy ROCm to the cloud. Then it marks the AMD ROCm support as in development, per discussion at standup.
- All agreed upon points are implemented / addressed.
- Yes
- Anything not implemented (discovered or discussed during work) has a follow-up story.
- I guess the question is, what should that follow-up story be? I think it depends on whether we're going to get more AMD hardware or what the next steps of productizing this are going to be. That feels like a Peter question.
- Code is tested and passing, both automated and manual, what manual testing was done is described
- See the build above
- Documentation has been updated.
- Yes
- Behaves appropriately at the intended scale (describe intended scale).
- N/A
- Considered backwards and forwards compatibility issues between client and server.
- This has never been in a release so marking it "in development" is fine
- Follows our coding standards and GUI style guidelines.
- N/A (no applicable style guide)
Updated by Peter Amstutz 19 days ago
Brett Smith wrote in #note-18:
22563-rocm-disk-size @ 333d0fcd754a6062b0317484c8f8c30f53ea06f5
Compute node build: packer-build-compute-image: #314
Workflow run: tordo-xvhdp-qa6br9r7kqdr299
This branch fills in some gaps discovered when trying to deploy ROCm to the cloud. Then it marks the AMD ROCm support as in development, per discussion at standup.
- All agreed upon points are implemented / addressed.
- Yes
- Anything not implemented (discovered or discussed during work) has a follow-up story.
- I guess the question is, what should that follow-up story be? I think it depends on whether we're going to get more AMD hardware or what the next steps of productizing this are going to be. That feels like a Peter question.
- Code is tested and passing, both automated and manual, what manual testing was done is described
- See the build above
- Documentation has been updated.
- Yes
- Behaves appropriately at the intended scale (describe intended scale).
- N/A
- Considered backwards and forwards compatibility issues between client and server.
- This has never been in a release so marking it "in development" is fine
- Follows our coding standards and GUI style guidelines.
- N/A (no applicable style guide)
LGTM
Updated by Peter Amstutz 19 days ago
Where I expect we will go with this is that once we have some hardware that has stable PCI passthrough of AMD GPU to KVM, then we will have an environment where we can automate standing up a virtual machine, installing the drivers and ROCm, and then confirming that the installation was successful.
Updated by Brett Smith 19 days ago
- Status changed from In Progress to Resolved
Applied in changeset arvados|42833eec3935b12a1f31968c572e075a97194ad2.