Project

General

Profile

Actions

Idea #17240

closed

Scoping/grooming GPU support work

Added by Peter Amstutz over 3 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
-
Start date:
04/07/2021
Due date:
Story points:
-

Description

https://docs.nvidia.com/deploy/cuda-compatibility/index.html

Nvidia says

The CUDA software environment consists of three parts:

  • CUDA Toolkit (libraries, CUDA runtime and developer tools) - User-mode SDK used to build CUDA applications
  • CUDA driver - User-mode driver component used to run CUDA applications (such as libcuda.so on Linux systems)
  • NVIDIA GPU device driver - Kernel-mode driver component for NVIDIA GPUs

On Linux systems, the CUDA driver and kernel mode components are delivered together in the NVIDIA display driver package. This is shown in Figure 1.

...

1.3. Binary Compatibility

We define binary compatibility as a set of guarantees provided by the library, where an application targeting the said library will continue to work when dynamically linked against a different version of the library.

The CUDA Driver API has a versioned C-style ABI, which guarantees that applications that were running against an older driver (for example CUDA 3.2) will still run and function correctly against a modern driver (for example one shipped with CUDA 11.0). This is a stronger contract than an API guarantee - an application might need to change its source when recompiling against a newer SDK, but replacing the driver with a newer version will always work.

The CUDA Driver API thus is binary-compatible (the OS loader can pick up a newer version and the application continues to work) but not source-compatible (rebuilding your application against a newer SDK might require source changes). In addition, the binary-compatibility is in one direction: backwards.

...

Each version of the CUDA Toolkit (and runtime) requires a minimum version of the NVIDIA driver. The CUDA driver (libcuda.so on Linux for example) included in the NVIDIA driver package, provides binary backward compatibility. For example, an application built against the CUDA 3.2 SDK will continue to function even on today’s driver stack. On the other hand, the CUDA runtime has not provided either source or binary compatibility guarantees. Newer major and minor versions of the CUDA runtime have frequently changed the exported symbols, including their version or even their availability, and the dynamic form of the library has its shared object name (.SONAME in Linux-based systems) change every minor version.

Notes

Inside the container: must include the correct nvidia runtime (if dynamically linked) or the application must be statically linked.

  • runtime requires a minimum version of the driver -- this should be declared as a requirement
  • nvidia-smi tells us some stuff?
  • cubins (programs compiled directly for a GPU) target a specific "compute" capability and are only backwards compatible across minor revisions.
  • required libraries: libcuda.so.* - the CUDA Driver
  • required libraries: libnvidia-ptxjitcompiler.so.*

Apparently nvidia also offers "driver containers" where it actually installs the kernel driver (???) on the fly and a persistent daemon, instead of relying on drivers being installed on the host.

https://docs.nvidia.com/datacenter/cloud-native/driver-containers/overview.html

  • Singularity support for bind mounting the nvidia driver exists. It apparently requires nvidia-container-cli
    https://github.com/NVIDIA/libnvidia-container
    Seems like you can use this tool to interrogate the system to find out which libraries need to be bind mounted.
node needs to declare, for each device:
  • driver version
  • hardware capability
container request needs to specify
  • minimum driver version → select nodes with >= minimum driver version
  • cubin hardware capability (or none) → select nodes with = major revision, >= minor revision hardware capability
    • can compile for multiple targets, so this should be a list
  • PTX hardware capability → select nodes with >= PTX hardware capability

It seems each SDK release has a new driver, so often the SDK version is printed instead of the underlying driver version. There's a table of which driver corresponds to which SDK revision.

With version 11 of the SDK it seems that the userspace part of the driver can be upgraded without upgrading the kernel driver.

What docker --gpus does

https://github.com/docker/cli/blob/88c6089300a82d3373892adf6845a4fed1a4ba8d/docs/reference/commandline/run.md

https://docs.docker.com/config/containers/resource_constraints/

https://github.com/docker/cli/blob/88c6089300a82d3373892adf6845a4fed1a4ba8d/opts/gpus.go

https://github.com/moby/moby/blob/46cdcd206c56172b95ba5c77b827a722dab426c5/daemon/nvidia_linux.go

Using the API, the simplest valid request is DeviceRequest with:
  • DeviceRequest.Count = 1
  • DeviceRequest.Capabilities = [ ["gpu"] ]
A better one is probably:
  • DeviceRequest.Driver = "nvidia"
  • DeviceRequest.Count = -1 (request all GPUs)
  • DeviceRequest.Capabilities = [ ["gpu", "nvidia", "compute"] ]

Docker sets NVIDIA_VISIBLE_DEVICES and NVIDIA_DRIVER_CAPABILITIES

Driver capabilities and options:

https://github.com/nvidia/nvidia-container-runtime#supported-driver-capabilities

NVIDIA_REQUIRE_CUDA is a thing, this takes the SDK version and checks the driver version.

CWL support

CUDARequirement:
  minCUDADriverVersion: "10.0"              (required)
  minHardwareCapability: "7.0"              (optional, default null)
  minDeviceCount: 1                         (optional, default 1)
  maxDeviceCount: 1                         (optional, default 1)
  • minCUDADriverVersion: minimum driver version. CUDA SDK version (each SDK release mostly corresponds to a new driver revision). We could use the driver revision but that seems more likely to confuse people.
  • minHardwareCapability: minimum nvidia hardware architecture
  • min/maxDeviceCount: can request/require multiple devices, if program supports it.

Arvados support

InstanceTypes:
  instanceWithGPU:
    ...
    CUDA:
      DriverVersion: "11.0" 
      HardwareCapability: "9.0" 
      DeviceCount: 1
runtime_constraints: {
  cuda_driver_version: "10.0" 
  cuda_hardware_capability: "9.0" 
  cuda_device_count: 1
}

Instance selection:

Select instance type which has

  • InstanceType.DriverVersion >= cuda_driver_version
  • InstanceType.HardwareCapability >= cuda_ptx_hardware_capability, or null
  • InstanceType.DevicesCount >= cuda_device_count

Update 24 Nov 2021

I revisited the "cuda-compatibility" document linked at the top. The discussion about "cubins" is gone. On some brief research, it appears that while cubins (containing pre-compiled architecture-specific code) are still a thing, a cubin bundle can also include the PTX code. So probably the distinction between hardware capability for cubins and ptx is unnecessary complexity. Edited to simplify hardware capability.


Subtasks 1 (0 open1 closed)

Task #17446: Group reviewResolved04/07/2021Actions

Related issues

Related to Arvados Epics - Idea #15957: GPU supportResolved10/01/202103/31/2022Actions
Related to Arvados - Feature #18323: CWL support for requesting container with CUDA supportResolvedPeter Amstutz12/20/2021Actions
Actions #1

Updated by Peter Amstutz over 3 years ago

  • Assigned To set to Peter Amstutz
Actions #2

Updated by Peter Amstutz over 3 years ago

  • Target version changed from 2021-01-20 Sprint to 2021-02-03 Sprint
Actions #3

Updated by Peter Amstutz about 3 years ago

  • Target version changed from 2021-02-03 Sprint to 2021-02-17 sprint
Actions #4

Updated by Peter Amstutz about 3 years ago

  • Target version changed from 2021-02-17 sprint to 2021-03-03 sprint
Actions #5

Updated by Peter Amstutz about 3 years ago

  • Target version changed from 2021-03-03 sprint to 2021-03-17 sprint
Actions #6

Updated by Peter Amstutz about 3 years ago

  • Description updated (diff)
Actions #7

Updated by Peter Amstutz about 3 years ago

  • Description updated (diff)
Actions #8

Updated by Peter Amstutz about 3 years ago

  • Description updated (diff)
Actions #9

Updated by Peter Amstutz about 3 years ago

  • Description updated (diff)
Actions #10

Updated by Peter Amstutz about 3 years ago

  • Description updated (diff)
Actions #11

Updated by Peter Amstutz about 3 years ago

  • Description updated (diff)
Actions #12

Updated by Peter Amstutz about 3 years ago

  • Description updated (diff)
Actions #13

Updated by Peter Amstutz about 3 years ago

  • Description updated (diff)
Actions #14

Updated by Peter Amstutz about 3 years ago

  • Description updated (diff)
Actions #15

Updated by Peter Amstutz about 3 years ago

  • Description updated (diff)
Actions #16

Updated by Peter Amstutz about 3 years ago

  • Status changed from New to In Progress
Actions #17

Updated by Peter Amstutz about 3 years ago

Actions #18

Updated by Peter Amstutz about 3 years ago

  • Target version changed from 2021-03-17 sprint to 2021-03-31 sprint
Actions #19

Updated by Peter Amstutz about 3 years ago

  • Target version changed from 2021-03-31 sprint to 2021-04-14 sprint
Actions #20

Updated by Peter Amstutz about 3 years ago

  • Target version changed from 2021-04-14 sprint to 2021-05-26 sprint
Actions #22

Updated by Peter Amstutz almost 3 years ago

  • Target version changed from 2021-05-26 sprint to 2021-06-09 sprint
Actions #23

Updated by Peter Amstutz almost 3 years ago

  • Target version changed from 2021-06-09 sprint to 2021-06-23 sprint
Actions #24

Updated by Peter Amstutz almost 3 years ago

  • Target version changed from 2021-06-23 sprint to 2021-07-07 sprint
Actions #25

Updated by Peter Amstutz almost 3 years ago

  • Target version changed from 2021-07-07 sprint to 2021-07-21 sprint
Actions #26

Updated by Peter Amstutz almost 3 years ago

  • Target version deleted (2021-07-21 sprint)
Actions #27

Updated by Peter Amstutz over 2 years ago

  • Description updated (diff)
Actions #28

Updated by Peter Amstutz over 2 years ago

  • Related to Feature #18323: CWL support for requesting container with CUDA support added
Actions #29

Updated by Peter Amstutz over 2 years ago

  • Status changed from In Progress to Resolved
Actions

Also available in: Atom PDF