Project

General

Profile

Actions

Feature #18325

closed

Option to include CUDA tooling in cloud compute image

Added by Peter Amstutz over 2 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Crunch
Target version:
Story points:
-
Release relationship:
Auto

Description

Node type

I used a "g4nd.xlarge" node for testing because on brief inspection, it seemed to be the cheapest GPU nodes available (something like $0.526/hr). It has a Tesla T4 GPU. However you could probably have packer install all this stuff on a non-GPU node.

Kernel stuff

Need to have the linux-headers package that corresponds exactly to the kernel image, this is because it use dkms to compile the nvidia kernel module on demand.

For Buster the latest seem to be:

linux-image-4.19.0-18-cloud-amd64
linux-headers-4.19.0-18-cloud-amd64

CUDA stuff

Note: starting with CUDA 11.5 they only support Debian Bullseye. The previous version, 11.4.3, only supports Buster.

Installation commands from https://developer.nvidia.com/cuda-11-4-3-download-archive?target_os=Linux&target_arch=x86_64&Distribution=Debian&target_version=10&target_type=deb_network

apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/debian10/x86_64/7fa2af80.pub
apt-get install software-properties-common
add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/debian10/x86_64/ /" 
add-apt-repository contrib
apt-get update
apt-get -y install cuda

After everything is installed, use "nvidia-detect" to make sure the GPU is detected and "nvidia-smi" to make sure the kernel module / driver is loaded.

If "nvidia-smi" doesn't work, it probably means the kernel module didn't build, try "dkms autoinstall" and see what failed.

Docker stuff

We need to have Docker 19.03 or later installed -- the current compute image is using the "docker.io" package shipped with Buster, which is 18.xx. The latest version in the docker-ce 19.03.xx series is 19.03.15. We could also upgrade to a more recent version.

curl -fsSL https://download.docker.com/linux/debian/gpg | sudo gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg
mkdir -p /etc/apt/sources.list.d && \
    echo deb [arch=amd64 signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/debian/ buster stable > /etc/apt/sources.list.d/docker.list && \
    apt-get update && \
    apt-get -yq --no-install-recommends install docker-ce=5:19.03.15~3-0~debian-buster && \
    apt-get clean

nvidia-container-toolkit

This is some additional tooling used by both Singularity and and Docker to support CUDA.

DIST=$(. /etc/os-release; echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/libnvidia-container/gpgkey | \
  sudo apt-key add -
curl -s -L https://nvidia.github.io/libnvidia-container/$DIST/libnvidia-container.list | \
  sudo tee /etc/apt/sources.list.d/libnvidia-container.list
sudo apt-get update
apt-get install libnvidia-container1 libnvidia-container-tools nvidia-container-toolkit

you might also need to restart docker after this is installed

systemctl restart docker

Testing that GPU is available inside the container

docker run --rm --gpus 1 nvidia/cuda:11.0-base nvidia-smi
singularity exec --nv docker://nvidia/cuda:11.0-base nvidia-smi

Subtasks 1 (0 open1 closed)

Task #18595: review 18325-cuda-azure-imageResolvedPeter Amstutz12/20/2021Actions

Related issues

Related to Arvados Epics - Idea #15957: GPU supportResolved10/01/202103/31/2022Actions
Related to Arvados - Support #18606: GPU support on tordo clusterResolvedWard VandewegeActions
Actions

Also available in: Atom PDF