Project

General

Profile

Actions

Bug #22792

closed

NVIDIA CUDA modules dance doesn't work with latest versions

Added by Brett Smith 4 days ago. Updated about 12 hours ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Deployment
Target version:
Story points:
-

Description

As seen in packer-build-compute-image: #328

12:09:26     amazon-ebs: TASK [ansible.builtin.include_role : compute_nvidia] ***************************
12:09:26     amazon-ebs:
12:09:26     amazon-ebs: TASK [compute_nvidia : Install NVIDIA package pins] ****************************
12:09:26     amazon-ebs: skipping: [default]
12:09:26     amazon-ebs:
12:09:26     amazon-ebs: TASK [compute_nvidia : Install NVIDIA CUDA apt repository] *********************
12:09:28     amazon-ebs: changed: [default]
12:09:28     amazon-ebs:
12:09:28     amazon-ebs: TASK [compute_nvidia : Install NVIDIA container toolkit apt repository] ********
12:09:29     amazon-ebs: changed: [default]
12:09:29     amazon-ebs:
12:09:29     amazon-ebs: TASK [compute_nvidia : Install NVIDIA CUDA build prerequisites] ****************
12:09:34     amazon-ebs: changed: [default]
12:09:34     amazon-ebs:
12:09:34     amazon-ebs: TASK [compute_nvidia : Install NVIDIA packages] ********************************
12:18:31     amazon-ebs: changed: [default]
12:18:31     amazon-ebs:
12:18:31     amazon-ebs: TASK [compute_nvidia : Copy nvidia.conf modules to nvidia.avail] ***************
12:18:31     amazon-ebs: fatal: [default]: FAILED! => {"changed": false, "msg": "Source /etc/modules-load.d/nvidia.conf not found"}

We do this as part of the dance to rig up a cloud image so we only load NVIDIA modules if we actually boot with an NVIDIA GPU. This dance works with CUDA 560 but apparently things moved around in CUDA 570. We will have to update our playbook to adapt.


Subtasks 1 (0 open1 closed)

Task #22797: Review 22792-cuda-570ResolvedLucas Di Pentima04/17/2025Actions
Actions #1

Updated by Brett Smith 3 days ago

So, this is mostly good news. Version 570 of the CUDA drivers switches from this static modules-load file to a udev rule. On a fresh install:

$ apt list --installed 'nvidia*' | cut -d/ -f1 | xargs dpkg -L | grep -e '^/etc/' -e /udev/
/etc/modprobe.d
/etc/modprobe.d/nvidia-modeset.conf
/etc/modprobe.d/nvidia.conf
/usr/lib/udev/rules.d
/usr/lib/udev/rules.d/60-nvidia.rules
/etc/OpenCL
/etc/OpenCL/vendors
/etc/OpenCL/vendors/nvidia.icd
/etc/X11

This means that out of the box, it has the dynamic behavior we want. We can upgrade our pin to version 570, rip out all our code to deal with this dynamism, and still have all the functionality we need. Great!

Except NVIDIA has stopped updating the driver for Debian 11. We need to continue supporting the layout of version 560 as long as we want to continue supporting Debian 11.

In principle we have agreed to drop Debian 11 for the next release. In practice, with all of our own clusters upgraded to Debian 12, I am not aware of any users currently on Debian 11. So, I think it would be okay to go ahead with the upgrade plan described above. But this would pretty officially commit us to dropping Debian 11, so I want to make sure we're on the same page about that before I go ahead.

Actions #2

Updated by Peter Amstutz 3 days ago

  • Target version set to Development 2025-04-16
Actions #3

Updated by Peter Amstutz 3 days ago

  • Target version changed from Development 2025-04-16 to Development 2025-04-30
Actions #4

Updated by Peter Amstutz 3 days ago

  • Assigned To set to Brett Smith
Actions #5

Updated by Peter Amstutz 3 days ago

  • Subtask #22797 added
Actions #6

Updated by Brett Smith 1 day ago

22792-cuda-570 @ be1277c6a8b468c25d91f07681d320569d3578ba - packer-build-compute-image: #329

  • All agreed upon points are implemented / addressed.
    • Yes
  • Anything not implemented (discovered or discussed during work) has a follow-up story.
    • N/A
  • Code is tested and passing, both automated and manual, what manual testing was done is described
    • Tested the new image at tordo-xvhdp-3o7bnol1rzguqj6 - You can tell this used the new image because it's using crunch-run 3.2.0~dev20250416134717 - note the very recent dev timestamp - and you can tell it successfully used CUDA in the subprocess logs (if it hadn't it wouldn't have run so fast)
  • Documentation has been updated.
    • N/A
  • Behaves appropriately at the intended scale (describe intended scale).
    • No change
  • Considered backwards and forwards compatibility issues between client and server.
    • Peter agreed at kickoff that it's okay to take this approach and begin abandoning Debian 11.
  • Follows our coding standards and GUI style guidelines.
    • N/A (no Ansible guide)
Actions #7

Updated by Lucas Di Pentima 1 day ago

As Peter use to say, "the best patch is the red patch"!

LGTM, thanks.

Actions #8

Updated by Brett Smith 1 day ago

  • Status changed from New to Resolved

Applied in changeset arvados-private:commit:arvados|828436783a47409a5ef76e084b7c47bc0a0695e2.

Actions

Also available in: Atom PDF