Feature #18321
closedImplement resource request and InstanceType selection for CUDA on arvados-dispatch-cloud
Added by Peter Amstutz about 3 years ago. Updated about 3 years ago.
Updated by Peter Amstutz about 3 years ago
- Related to Idea #15957: GPU support added
Updated by Peter Amstutz about 3 years ago
- Target version set to 2021-11-24 sprint
Updated by Peter Amstutz about 3 years ago
- Target version changed from 2021-11-24 sprint to 2021-12-08 sprint
Updated by Peter Amstutz about 3 years ago
- Target version changed from 2021-12-08 sprint to 2022-01-05 sprint
Updated by Peter Amstutz about 3 years ago
18321-gpu-instancetype @ 0bf88a25aa9f1ce2f4fb9e49504646ebdcaeb633
Updated by Peter Amstutz about 3 years ago
18321-gpu-instancetype @ 23cc75f39c49d20c784fc16cb1a590f023c93559
Updated by Peter Amstutz about 3 years ago
- Status changed from New to In Progress
Updated by Tom Clegg about 3 years ago
If compareVersion returned a<b
instead of a>=b
it would work in the same direction as the stdlib sort pkg's "less" funcs. Perhaps even call it "versionLess" or something?
compareVersion returns false for for blanks, meaning an instance type with a missing version will never be selected (which seems reasonable although we might want to warn about "gpu will never be used" in config-check), and a container with Devices>0 but no specified version/hardware will not run at all, meaning you need to specify "0.0" if what you want is "any". It looks like these cases aren't tested -- is the idea that controller/railsapi will enforce non-empty version fields when Devices>0?
It looks like an instance with GPUs will never be selected for a container that doesn't need them, even if it's the cheapest/only option. Suspect you meant to write this? (This situation is probably worth testing.)
- case it.CUDA.DeviceCount > 0 && !compareVersion(it.CUDA.DriverVersion, ctr.RuntimeConstraints.CUDADriverVersion): // insufficient driver version
- case it.CUDA.DeviceCount > 0 && !compareVersion(it.CUDA.HardwareCapability, ctr.RuntimeConstraints.CUDAHardwareCapability): // insufficient hardware capability
+ case ctr.RuntimeConstraints.CUDADeviceCount > 0 && !compareVersion(it.CUDA.DriverVersion, ctr.RuntimeConstraints.CUDADriverVersion): // insufficient driver version
+ case ctr.RuntimeConstraints.CUDADeviceCount > 0 && !compareVersion(it.CUDA.HardwareCapability, ctr.RuntimeConstraints.CUDAHardwareCapability): // insufficient hardware capability
We have {CUDADeviceCount: 1, CUDADriverVersion: "11.0"} in runtime constraints, but {CUDA: {DeviceCount: 1, DriverVersion: "11.0"}} in InstanceTypes config. I think it would be more user-friendly to spell both the same way, unless there's a reason to make them different?
The test case has a "small" instance without GPUs but it doesn't seem to have enough memory to be selected even if the GPU requirements are being ignored. Perhaps the test deserves a decoy with more RAM and VCPUs than everything else, and lower cost, but no GPU.
Updated by Peter Amstutz about 3 years ago
18321-gpu-instancetype @ cb2d522176c17f2d388098b70fdbaa90fb30e682
- Refactor RuntimeConstraints, CUDA is in a separate struct to match InstanceType CUDAFeatures
- Adjust version comparison. Blank versions are treated as wildcards
- Refactor tests
Updated by Tom Clegg about 3 years ago
AFAIK "omitempty" doesn't do anything on a struct field. Is there even a reason to use omitempty on the non-struct fields in RuntimeConstraints? We already use "missing" to mean "I'm not sending the value" in some places and "the value is zero" in other places, so I don't think we should use "omitempty" if we don't need to. It has a history of solving problems now and creating problems later...
Updated test is good. Maybe add an instance type with an wildcard DriverVersion too?
Looks like "12,0" and "12.0.0" will also behave like wildcards, which might be annoying to troubleshoot. Worth doing something in config-loader to warn / fail on anything other than a float or an explicit wildcard string like "*"?
Updated by Peter Amstutz about 3 years ago
18321-gpu-instancetype @ 7f88afd565b76903ad4b27fb896ff0cd844dfb7f
- I changed my mind on wildcard matches and decided to make it strict, so it requires valid versions in both the InstanceType and RuntimeConstraints and will reject it otherwise.
- Adjusted the test cases accordingly
- Removing omitempty resulted in test failures between rails and controller, which led me to teaching the Rails API about CUDA
- Added a container reuse test for CUDA
- Added a fallback so it can still find and reuse container records prior to the addition of CUDA support
Updated by Tom Clegg about 3 years ago
Strict matching seems safer.
Good catch on the reuse problem. I'm wary of doing a count() on the partial-match set, though. There might be some edge cases (newer completed container exists so we don't check for a pre-CUDA one, but the newer one has non-readable log/output, so we don't end up finding anything). Not sure whether any such edge cases actually matter that much. But is it worth considering other options that don't add a db query, like a migration that adds the zero/empty values to existing container records? Data migrations that touch every row are annoying but at least we would not have the extra DB query forever. Or delete empty cuda sections when saving container records and searching for reusable ones? Wouldn't require a migration but seems a bit messy. Or where_serialized could be modified so we can do "matches md5(x) or md5(x_with_empty_cuda_section_removed)"?
The runtime_constraints doc should be updated to indicate the driver/capability fields are no longer optional when device_count>0.
It looks like runtime_constraints.cuda.device_count=0 with non-empty driver/capability fields will be treated as "no GPUs" by everyone until crunch-run, which upgrades device_count to 1 and tries to use GPUs. Perhaps this should be rejected by RailsAPI? And crunch-run should ignore the version fields and go by device_count alone?
It looks like invalid driver/capability fields are undetected until it's time to choose a node type, where we fail "constraints not satisfiable". Seems like the RailsAPI "must be a string" checks could also check that the runtime_constraints fields parse as floats, and lib/config could validate the instance type fields with a (*Loader)checkCUDAVersions() (see checkStorageClasses).
Updated by Peter Amstutz about 3 years ago
Tom Clegg wrote:
Or where_serialized could be modified so we can do "matches md5(x) or md5(x_with_empty_cuda_section_removed)"?
Yea, this is a good idea. Done.
The runtime_constraints doc should be updated to indicate the driver/capability fields are no longer optional when device_count>0.
Updated.
It looks like runtime_constraints.cuda.device_count=0 with non-empty driver/capability fields will be treated as "no GPUs" by everyone until crunch-run, which upgrades device_count to 1 and tries to use GPUs. Perhaps this should be rejected by RailsAPI? And crunch-run should ignore the version fields and go by device_count alone?
Brought crunch-run in line with the other code.
It looks like invalid driver/capability fields are undetected until it's time to choose a node type, where we fail "constraints not satisfiable". Seems like the RailsAPI "must be a string" checks could also check that the runtime_constraints fields parse as floats, and lib/config could validate the instance type fields with a (*Loader)checkCUDAVersions() (see checkStorageClasses).
Enhanced check in validate_runtime_constraints & added config check.
18321-gpu-instancetype @ 6fe152024269d838e31bc224adbd518c43cbfee5
Updated by Tom Clegg about 3 years ago
Seems like the word "version" could be deleted from the error message "[...] must be a string in format 'X.Y' version"
Everything else LGTM, thanks!
Updated by Peter Amstutz about 3 years ago
- Status changed from In Progress to Resolved