Feature #18325
Updated by Peter Amstutz about 3 years ago
h1. Node type
I used a "g4nd.xlarge" node for testing because on brief inspection, it seemed to be the cheapest GPU nodes available (something like $0.526/hr). It has a Tesla T4 GPU. However you could probably have packer install all this stuff on a non-GPU node.
h1. Kernel stuff
Need to have the linux-headers package that corresponds exactly to the kernel image, this is because it use @dkms@ to compile the nvidia kernel module on demand.
For Buster the latest seem to be:
linux-image-4.19.0-18-cloud-amd64
linux-headers-4.19.0-18-cloud-amd64
h2. CUDA stuff
Note: starting with CUDA 11.5 they only support Debian Bullseye. The previous version, 11.4.3, only supports Buster.
Installation commands from https://developer.nvidia.com/cuda-11-4-3-download-archive?target_os=Linux&target_arch=x86_64&Distribution=Debian&target_version=10&target_type=deb_network
<pre>
apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/debian10/x86_64/7fa2af80.pub
apt-get install software-properties-common
add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/debian10/x86_64/ /"
add-apt-repository contrib
apt-get update
apt-get -y install cuda
</pre>
After everything is installed, use "nvidia-detect" to make sure the GPU is detected and "nvidia-smi" to make sure the kernel module / driver is loaded.
If "nvidia-smi" doesn't work, it probably means the kernel module didn't build, try "dkms autoinstall" and see what failed.
h2. Docker stuff
We need to have Docker 19.03 or later installed -- the current compute image is using the "docker.io" package shipped with Buster, which is 18.xx. The latest version in the docker-ce 19.03.xx series is 19.03.15. We could also upgrade to a more recent version.
<pre>
curl -fsSL https://download.docker.com/linux/debian/gpg | sudo gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg
mkdir -p /etc/apt/sources.list.d && \
echo deb [arch=amd64 signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/debian/ buster stable > /etc/apt/sources.list.d/docker.list && \
apt-get update && \
apt-get -yq --no-install-recommends install docker-ce=5:19.03.15~3-0~debian-buster && \
apt-get clean
</pre>
h2. nvidia-container-toolkit
This is some additional tooling used by both Singularity and and Docker to support CUDA.
<pre>
DIST=$(. /etc/os-release; echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/libnvidia-container/gpgkey | \
sudo apt-key add -
curl -s -L https://nvidia.github.io/libnvidia-container/$DIST/libnvidia-container.list | \
sudo tee /etc/apt/sources.list.d/libnvidia-container.list
sudo apt-get update
apt-get install libnvidia-container1 libnvidia-container-tools nvidia-container-toolkit
</pre>
you might also need to restart docker after this is installed
<pre>
systemctl restart docker
</pre>
h2. Testing that GPU is available inside the container
<pre>
docker run --rm --gpus 1 nvidia/cuda:11.0-base nvidia-smi
</pre>
<pre>
singularity exec --nv docker://nvidia/cuda:11.0-base nvidia-smi
</pre>