Project

General

Profile

Actions

Feature #18321

closed

Implement resource request and InstanceType selection for CUDA on arvados-dispatch-cloud

Added by Peter Amstutz 8 months ago. Updated 6 months ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Crunch
Target version:
Start date:
12/14/2021
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
-
Release relationship:
Auto

Subtasks 1 (0 open1 closed)

Task #18472: review 18321-gpu-instancetypeResolvedPeter Amstutz12/14/2021

Actions

Related issues

Related to Arvados Epics - Story #15957: GPU supportResolved10/01/202103/31/2022

Actions
Actions #1

Updated by Peter Amstutz 8 months ago

Actions #2

Updated by Peter Amstutz 8 months ago

  • Release set to 46
Actions #3

Updated by Peter Amstutz 8 months ago

  • Target version set to 2021-11-24 sprint
Actions #4

Updated by Peter Amstutz 8 months ago

  • Target version changed from 2021-11-24 sprint to 2021-12-08 sprint
Actions #5

Updated by Peter Amstutz 7 months ago

  • Assigned To set to Peter Amstutz
Actions #6

Updated by Peter Amstutz 7 months ago

  • Target version changed from 2021-12-08 sprint to 2022-01-05 sprint
Actions #9

Updated by Peter Amstutz 7 months ago

  • Status changed from New to In Progress
Actions #10

Updated by Tom Clegg 7 months ago

If compareVersion returned a<b instead of a>=b it would work in the same direction as the stdlib sort pkg's "less" funcs. Perhaps even call it "versionLess" or something?

compareVersion returns false for for blanks, meaning an instance type with a missing version will never be selected (which seems reasonable although we might want to warn about "gpu will never be used" in config-check), and a container with Devices>0 but no specified version/hardware will not run at all, meaning you need to specify "0.0" if what you want is "any". It looks like these cases aren't tested -- is the idea that controller/railsapi will enforce non-empty version fields when Devices>0?

It looks like an instance with GPUs will never be selected for a container that doesn't need them, even if it's the cheapest/only option. Suspect you meant to write this? (This situation is probably worth testing.)

-               case it.CUDA.DeviceCount > 0 && !compareVersion(it.CUDA.DriverVersion, ctr.RuntimeConstraints.CUDADriverVersion): // insufficient driver version
-               case it.CUDA.DeviceCount > 0 && !compareVersion(it.CUDA.HardwareCapability, ctr.RuntimeConstraints.CUDAHardwareCapability): // insufficient hardware capability
+               case ctr.RuntimeConstraints.CUDADeviceCount > 0 && !compareVersion(it.CUDA.DriverVersion, ctr.RuntimeConstraints.CUDADriverVersion): // insufficient driver version
+               case ctr.RuntimeConstraints.CUDADeviceCount > 0 && !compareVersion(it.CUDA.HardwareCapability, ctr.RuntimeConstraints.CUDAHardwareCapability): // insufficient hardware capability

We have {CUDADeviceCount: 1, CUDADriverVersion: "11.0"} in runtime constraints, but {CUDA: {DeviceCount: 1, DriverVersion: "11.0"}} in InstanceTypes config. I think it would be more user-friendly to spell both the same way, unless there's a reason to make them different?

The test case has a "small" instance without GPUs but it doesn't seem to have enough memory to be selected even if the GPU requirements are being ignored. Perhaps the test deserves a decoy with more RAM and VCPUs than everything else, and lower cost, but no GPU.

Actions #11

Updated by Peter Amstutz 6 months ago

18321-gpu-instancetype @ cb2d522176c17f2d388098b70fdbaa90fb30e682

developer-run-tests: #2845

  • Refactor RuntimeConstraints, CUDA is in a separate struct to match InstanceType CUDAFeatures
  • Adjust version comparison. Blank versions are treated as wildcards
  • Refactor tests
Actions #12

Updated by Tom Clegg 6 months ago

AFAIK "omitempty" doesn't do anything on a struct field. Is there even a reason to use omitempty on the non-struct fields in RuntimeConstraints? We already use "missing" to mean "I'm not sending the value" in some places and "the value is zero" in other places, so I don't think we should use "omitempty" if we don't need to. It has a history of solving problems now and creating problems later...

Updated test is good. Maybe add an instance type with an wildcard DriverVersion too?

Looks like "12,0" and "12.0.0" will also behave like wildcards, which might be annoying to troubleshoot. Worth doing something in config-loader to warn / fail on anything other than a float or an explicit wildcard string like "*"?

Actions #13

Updated by Peter Amstutz 6 months ago

18321-gpu-instancetype @ 7f88afd565b76903ad4b27fb896ff0cd844dfb7f

  • I changed my mind on wildcard matches and decided to make it strict, so it requires valid versions in both the InstanceType and RuntimeConstraints and will reject it otherwise.
  • Adjusted the test cases accordingly
  • Removing omitempty resulted in test failures between rails and controller, which led me to teaching the Rails API about CUDA
  • Added a container reuse test for CUDA
  • Added a fallback so it can still find and reuse container records prior to the addition of CUDA support

developer-run-tests: #2846

Actions #14

Updated by Tom Clegg 6 months ago

Strict matching seems safer.

Good catch on the reuse problem. I'm wary of doing a count() on the partial-match set, though. There might be some edge cases (newer completed container exists so we don't check for a pre-CUDA one, but the newer one has non-readable log/output, so we don't end up finding anything). Not sure whether any such edge cases actually matter that much. But is it worth considering other options that don't add a db query, like a migration that adds the zero/empty values to existing container records? Data migrations that touch every row are annoying but at least we would not have the extra DB query forever. Or delete empty cuda sections when saving container records and searching for reusable ones? Wouldn't require a migration but seems a bit messy. Or where_serialized could be modified so we can do "matches md5(x) or md5(x_with_empty_cuda_section_removed)"?

The runtime_constraints doc should be updated to indicate the driver/capability fields are no longer optional when device_count>0.

It looks like runtime_constraints.cuda.device_count=0 with non-empty driver/capability fields will be treated as "no GPUs" by everyone until crunch-run, which upgrades device_count to 1 and tries to use GPUs. Perhaps this should be rejected by RailsAPI? And crunch-run should ignore the version fields and go by device_count alone?

It looks like invalid driver/capability fields are undetected until it's time to choose a node type, where we fail "constraints not satisfiable". Seems like the RailsAPI "must be a string" checks could also check that the runtime_constraints fields parse as floats, and lib/config could validate the instance type fields with a (*Loader)checkCUDAVersions() (see checkStorageClasses).

Actions #15

Updated by Peter Amstutz 6 months ago

Tom Clegg wrote:

Or where_serialized could be modified so we can do "matches md5(x) or md5(x_with_empty_cuda_section_removed)"?

Yea, this is a good idea. Done.

The runtime_constraints doc should be updated to indicate the driver/capability fields are no longer optional when device_count>0.

Updated.

It looks like runtime_constraints.cuda.device_count=0 with non-empty driver/capability fields will be treated as "no GPUs" by everyone until crunch-run, which upgrades device_count to 1 and tries to use GPUs. Perhaps this should be rejected by RailsAPI? And crunch-run should ignore the version fields and go by device_count alone?

Brought crunch-run in line with the other code.

It looks like invalid driver/capability fields are undetected until it's time to choose a node type, where we fail "constraints not satisfiable". Seems like the RailsAPI "must be a string" checks could also check that the runtime_constraints fields parse as floats, and lib/config could validate the instance type fields with a (*Loader)checkCUDAVersions() (see checkStorageClasses).

Enhanced check in validate_runtime_constraints & added config check.

18321-gpu-instancetype @ 6fe152024269d838e31bc224adbd518c43cbfee5

developer-run-tests: #2847

Actions #16

Updated by Tom Clegg 6 months ago

Seems like the word "version" could be deleted from the error message "[...] must be a string in format 'X.Y' version"

Everything else LGTM, thanks!

Actions #17

Updated by Peter Amstutz 6 months ago

  • Status changed from In Progress to Resolved
Actions

Also available in: Atom PDF