Project

General

Profile

Feature #18325

Updated by Peter Amstutz over 2 years ago

h1. Node type 

 I used a "g4nd.xlarge" node for testing because on brief inspection, it seemed to be the cheapest GPU nodes available (something like $0.526/hr).    It has a Tesla T4 GPU.    However you could probably have packer install all this stuff on a non-GPU node. 

 h1. Kernel stuff 

 Need to have the linux-headers package that corresponds exactly to the kernel image, this is because it use @dkms@ to compile the nvidia kernel module on demand. 

 For Buster the latest seem to be: 

 linux-image-4.19.0-18-cloud-amd64 
 linux-headers-4.19.0-18-cloud-amd64 

 h2. CUDA stuff 

 Note: starting with CUDA 11.5 they only support Debian Bullseye.    The previous version, 11.4.3, only supports Buster. 

 Installation commands from https://developer.nvidia.com/cuda-11-4-3-download-archive?target_os=Linux&target_arch=x86_64&Distribution=Debian&target_version=10&target_type=deb_network 

 <pre> 
 apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/debian10/x86_64/7fa2af80.pub 
 apt-get install software-properties-common 
 add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/debian10/x86_64/ /" 
 add-apt-repository contrib 
 apt-get update 
 apt-get -y install cuda 
 </pre> 

 After everything is installed, use "nvidia-detect" to make sure the GPU is detected and "nvidia-smi" to make sure the kernel module / driver is loaded. 

 If "nvidia-smi" doesn't work, it probably means the kernel module didn't build, try "dkms autoinstall" and see what failed. 

 h2. Docker stuff 

 We need to have Docker 19.03 or later installed -- the current compute image is using the "docker.io" package shipped with Buster, which is 18.xx.    The latest version in the docker-ce 19.03.xx series is 19.03.15.    We could also upgrade to a more recent version. 

 <pre> 
 curl -fsSL https://download.docker.com/linux/debian/gpg | sudo gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg 
 mkdir -p /etc/apt/sources.list.d && \ 
     echo deb [arch=amd64 signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/debian/ buster stable > /etc/apt/sources.list.d/docker.list && \ 
     apt-get update && \ 
     apt-get -yq --no-install-recommends install docker-ce=5:19.03.15~3-0~debian-buster && \ 
     apt-get clean 
 </pre> 

 h2. nvidia-container-toolkit 

 This is some additional tooling used by both Singularity and and Docker to support CUDA. 

 <pre> 
 DIST=$(. /etc/os-release; echo $ID$VERSION_ID) 
 curl -s -L https://nvidia.github.io/libnvidia-container/gpgkey | \ 
   sudo apt-key add - 
 curl -s -L https://nvidia.github.io/libnvidia-container/$DIST/libnvidia-container.list | \ 
   sudo tee /etc/apt/sources.list.d/libnvidia-container.list 
 sudo apt-get update 
 apt-get install libnvidia-container1 libnvidia-container-tools nvidia-container-toolkit 
 </pre> 

 you might also need to restart docker after this is installed 

 <pre> 
 systemctl restart docker 
 </pre> 

 h2. Testing that GPU is available inside the container  

 <pre> 
 docker run --rm --gpus 1 nvidia/cuda:11.0-base nvidia-smi 
 </pre> 

 <pre> 
 singularity exec --nv docker://nvidia/cuda:11.0-base nvidia-smi 
 </pre> 

Back