Project

General

Profile

Actions

Support #22562

closed

Test running CUDA tordo with updated pins

Added by Peter Amstutz about 1 month ago. Updated 3 days ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Tests
Target version:
Due date:
Story points:
-
Release relationship:
Auto

Subtasks 1 (0 open1 closed)

Task #22634: Review run workflow to confirm it represents a successful testResolvedPeter Amstutz03/10/2025Actions

Related issues 3 (0 open3 closed)

Related to Arvados - Feature #22563: compute node ansible playbook to install ROCmResolvedBrett Smith02/17/2025Actions
Related to Arvados - Bug #22612: CUDA install doesn't really work because headers aren't availableResolvedBrett SmithActions
Related to Arvados - Support #22597: GPU test workflows for tordoResolvedPeter AmstutzActions
Actions #1

Updated by Peter Amstutz about 1 month ago

  • Position changed from -937625 to -937622
Actions #2

Updated by Peter Amstutz about 1 month ago

  • Target version changed from Development 2025-02-26 to Development 2025-03-19
Actions #3

Updated by Peter Amstutz about 1 month ago

  • Related to Feature #22563: compute node ansible playbook to install ROCm added
Actions #4

Updated by Brett Smith 25 days ago

As noted in #22563#note-12, testing ROCm on tordo is complicated by the fact that AMD does not publish packages for Debian 11. One option is we could try building a compute node on Debian 12, it's goofy but it should be straightforward and compute nodes are self-contained enough that I can't think of anything that would break while the rest of the cluster was still on Debian 11. There might be other options.

Actions #5

Updated by Peter Amstutz 25 days ago

Brett Smith wrote in #note-4:

As noted in #22563#note-12, testing ROCm on tordo is complicated by the fact that AMD does not publish packages for Debian 11. One option is we could try building a compute node on Debian 12, it's goofy but it should be straightforward and compute nodes are self-contained enough that I can't think of anything that would break while the rest of the cluster was still on Debian 11. There might be other options.

I think that would be fine, I also can't think of anything that would tie the compute node's OS version to the OS version running on the rest of the cluster. Have we done any other testing with Debian 12 on compute nodes?

Actions #6

Updated by Brett Smith 25 days ago

Peter Amstutz wrote in #note-5:

Have we done any other testing with Debian 12 on compute nodes?

Not on an actually deployed node. The most I can say is that all the relevant tests pass on Debian 12, including integration tests with both Docker and Singularity.

Actions #7

Updated by Brett Smith 23 days ago

After a handful of small bugfixes, building a Debian 12 compute image for tordo with ROCm support is currently stuck on:

$ cat /var/lib/dkms/amdgpu/6.10.5-2109964.22.04/build/make.log
DKMS make.log for amdgpu-6.10.5-2109964.22.04 for kernel 6.1.0-31-cloud-amd64 (amd64)
Thu Feb 20 21:34:52 UTC 2025
make: Entering directory '/usr/src/linux-headers-6.1.0-31-cloud-amd64'
/tmp/amd.VycmSonU/Makefile:38: *** CONFIG_DRM disabled, exit....  Stop.
make: *** [/usr/src/linux-headers-6.1.0-31-common/Makefile:2034: /tmp/amd.VycmSonU] Error 2
make: Leaving directory '/usr/src/linux-headers-6.1.0-31-cloud-amd64'
Actions #8

Updated by Brett Smith 22 days ago

Notes mostly for myself in the morning: all the bugfixes so far have been "obvious," they improve either all cloud deployments or the standard ROCm installation. They're fine.

We're past the point of obvious. My guess is, I could install the standard Linux image, reboot into that, and then build amdgpu against it. But would the end result actually be a usable cloud node? I don't know, and I should get some assurance on that point before I work on automating it.

How do you get a cloud node with an AMD GPU? Are there particular images recommended for it? How are they set up? Maybe the recipe for building a ROCm cloud node is "use a recommended base image, then add Arvados software to it," rather than trying to start from base Debian/Ubuntu.

Actions #9

Updated by Brett Smith 22 days ago

I do not believe it is possible to use ROCm in AWS.

The only instance type with an AMD GPU is the G4ad which can include up to four AMD Radeon Pro V520 GPUs.

This GPU is not listed on the table of supported GPUs in the ROCm documentation. The closest thing is the AMD Radeon Pro V620.

See also ROCm issue #1341 where someone reports basically this exact problem and the response is a vague "support is coming." Yes it's years old, but if you search the GitHub repository issues for aws you won't find anything more relevant, just a couple of dupes.

I'll keep looking to see if I can find a blog post or something where someone has documented how to do this, but I'm not optimistic. Right now I think our plan for 3.1.0 is to document that ROCm is not supported on the cloud and we'd be happy for any contributions that can help us do that.

Actions #10

Updated by Peter Amstutz 22 days ago

DRM is Direct Rendering Manager, which is the graphics accelerator subsystem in Linux. It makes sense that a headless cloud VM image wouldn't typically have the graphics subsystem included in the kernel.

I found this page:

https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/install-amd-driver.html

It doesn't include ROCm in the list of supported APIs, so I think you are right that this isn't supported by AWS right now.

Actions #11

Updated by Brett Smith 22 days ago

Peter Amstutz wrote in #note-10:

DRM is Direct Rendering Manager

I know, thanks.

It doesn't include ROCm in the list of supported APIs, so I think you are right that this isn't supported by AWS right now.

Okay, so how are we going to test this actually produces a working installation? All of Curii's compute capacity is on AWS right now. Can we look to see if it makes sense to spin something up on another cloud? Can other developers get access to the test hardware we have?

Actions #12

Updated by Brett Smith 22 days ago

The ROCm AI Developer Hub lists "Featured Cloud Partners" with supported Instinct GPUs. It's Azure and a bunch of companies you probably don't want to deal with.

Actions #13

Updated by Brett Smith 22 days ago

Docker 28 released two days ago. As long as release stays on track for the first week of March, I think we should stick with pinning Docker 27, then update the pins and build a new Jenkins test node to run tests on Docker 28. The latest tordo compute node build shows Docker 28 can run a simple workflow, but we should also test arv-keepdocker etc. before planning to deploy it.

Actions #14

Updated by Peter Amstutz 18 days ago

  • Related to Bug #22612: CUDA install doesn't really work because headers aren't available added
Actions #15

Updated by Peter Amstutz 18 days ago

Actions #16

Updated by Peter Amstutz 18 days ago

  • Status changed from New to In Progress
Actions #17

Updated by Peter Amstutz 18 days ago

  • Release set to 75
Actions #18

Updated by Brett Smith 16 days ago

#22612#note-6 might be our CUDA test. I did have to tweak the workflow to get it working, so it might be good if Peter could double-check that the test is valid, but I believe it is.

Actions #19

Updated by Brett Smith 10 days ago

  • Assigned To set to Brett Smith
Actions #20

Updated by Brett Smith 10 days ago

  • Subtask #22634 added
Actions #21

Updated by Peter Amstutz 5 days ago

  • Subject changed from Test running CUDA and ROCm on tordo with updated pins to Test running CUDA tordo with updated pins
Actions #22

Updated by Peter Amstutz 5 days ago

  • Status changed from In Progress to Resolved
Actions

Also available in: Atom PDF