Bug #22792
closedNVIDIA CUDA modules dance doesn't work with latest versions
Description
As seen in packer-build-compute-image: #328
12:09:26 amazon-ebs: TASK [ansible.builtin.include_role : compute_nvidia] *************************** 12:09:26 amazon-ebs: 12:09:26 amazon-ebs: TASK [compute_nvidia : Install NVIDIA package pins] **************************** 12:09:26 amazon-ebs: skipping: [default] 12:09:26 amazon-ebs: 12:09:26 amazon-ebs: TASK [compute_nvidia : Install NVIDIA CUDA apt repository] ********************* 12:09:28 amazon-ebs: changed: [default] 12:09:28 amazon-ebs: 12:09:28 amazon-ebs: TASK [compute_nvidia : Install NVIDIA container toolkit apt repository] ******** 12:09:29 amazon-ebs: changed: [default] 12:09:29 amazon-ebs: 12:09:29 amazon-ebs: TASK [compute_nvidia : Install NVIDIA CUDA build prerequisites] **************** 12:09:34 amazon-ebs: changed: [default] 12:09:34 amazon-ebs: 12:09:34 amazon-ebs: TASK [compute_nvidia : Install NVIDIA packages] ******************************** 12:18:31 amazon-ebs: changed: [default] 12:18:31 amazon-ebs: 12:18:31 amazon-ebs: TASK [compute_nvidia : Copy nvidia.conf modules to nvidia.avail] *************** 12:18:31 amazon-ebs: fatal: [default]: FAILED! => {"changed": false, "msg": "Source /etc/modules-load.d/nvidia.conf not found"}
We do this as part of the dance to rig up a cloud image so we only load NVIDIA modules if we actually boot with an NVIDIA GPU. This dance works with CUDA 560 but apparently things moved around in CUDA 570. We will have to update our playbook to adapt.
Updated by Brett Smith 3 days ago
So, this is mostly good news. Version 570 of the CUDA drivers switches from this static modules-load
file to a udev rule. On a fresh install:
$ apt list --installed 'nvidia*' | cut -d/ -f1 | xargs dpkg -L | grep -e '^/etc/' -e /udev/ /etc/modprobe.d /etc/modprobe.d/nvidia-modeset.conf /etc/modprobe.d/nvidia.conf /usr/lib/udev/rules.d /usr/lib/udev/rules.d/60-nvidia.rules /etc/OpenCL /etc/OpenCL/vendors /etc/OpenCL/vendors/nvidia.icd /etc/X11
This means that out of the box, it has the dynamic behavior we want. We can upgrade our pin to version 570, rip out all our code to deal with this dynamism, and still have all the functionality we need. Great!
Except NVIDIA has stopped updating the driver for Debian 11. We need to continue supporting the layout of version 560 as long as we want to continue supporting Debian 11.
In principle we have agreed to drop Debian 11 for the next release. In practice, with all of our own clusters upgraded to Debian 12, I am not aware of any users currently on Debian 11. So, I think it would be okay to go ahead with the upgrade plan described above. But this would pretty officially commit us to dropping Debian 11, so I want to make sure we're on the same page about that before I go ahead.
Updated by Peter Amstutz 3 days ago
- Target version set to Development 2025-04-16
Updated by Peter Amstutz 3 days ago
- Target version changed from Development 2025-04-16 to Development 2025-04-30
Updated by Brett Smith 1 day ago
22792-cuda-570 @ be1277c6a8b468c25d91f07681d320569d3578ba - packer-build-compute-image: #329
- All agreed upon points are implemented / addressed.
- Yes
- Anything not implemented (discovered or discussed during work) has a follow-up story.
- N/A
- Code is tested and passing, both automated and manual, what manual testing was done is described
- Tested the new image at tordo-xvhdp-3o7bnol1rzguqj6 - You can tell this used the new image because it's using
crunch-run 3.2.0~dev20250416134717
- note the very recent dev timestamp - and you can tell it successfully used CUDA in the subprocess logs (if it hadn't it wouldn't have run so fast)
- Tested the new image at tordo-xvhdp-3o7bnol1rzguqj6 - You can tell this used the new image because it's using
- Documentation has been updated.
- N/A
- Behaves appropriately at the intended scale (describe intended scale).
- No change
- Considered backwards and forwards compatibility issues between client and server.
- Peter agreed at kickoff that it's okay to take this approach and begin abandoning Debian 11.
- Follows our coding standards and GUI style guidelines.
- N/A (no Ansible guide)
Updated by Lucas Di Pentima 1 day ago
As Peter use to say, "the best patch is the red patch"!
LGTM, thanks.
Updated by Brett Smith 1 day ago
- Status changed from New to Resolved
Applied in changeset arvados-private:commit:arvados|828436783a47409a5ef76e084b7c47bc0a0695e2.