Feature #21926
openAMD ROCm GPU support
Added by Peter Amstutz 8 months ago. Updated about 9 hours ago.
Description
docker run -it --device=/dev/kfd --device=/dev/dri/card0 --device=/dev/dri/renderD128 --group-add=video --network=host --ipc=host --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --shm-size 16G -v [directory binding options] --name [ollama-blablabla] ollama/ollama:rocm
https://rocm.docs.amd.com/projects/install-on-linux/en/latest/how-to/docker.html
Files
crunch-run_3.1.0~dev20241218180627-1_amd64.deb (26.8 MB) crunch-run_3.1.0~dev20241218180627-1_amd64.deb | Peter Amstutz, 12/18/2024 08:12 PM |
Updated by Peter Amstutz 6 months ago
- Subject changed from ROCm GPU support to AMD ROCm GPU support
Updated by Peter Amstutz about 2 months ago
It works without --network=host --ipc=host --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --shm-size 16G
Those options all reduce security, so if we can make it work with just this, that's way better:
docker run -it --device=/dev/kfd --device=/dev/dri/card0 --device=/dev/dri/renderD128 --group-add=video -v [directory binding options] --name [ollama-blablabla] ollama/ollama:rocm
Updated by Peter Amstutz about 2 months ago
- File crunch-run_3.1.0~dev20241218180627-1_amd64.deb crunch-run_3.1.0~dev20241218180627-1_amd64.deb added
Attached package implements prototype ROCm support in crunch run.
If AMD_VISIBLE_DEVICES is set when crunch-run is executed (you can set AMD_VISIBLE_DEVICES before running crunch-dispatch-local) then crunch-run will make the GPU devices available to the container.
Updated by Peter Amstutz 9 days ago
- Target version changed from Future to Development 2025-01-29
Updated by Peter Amstutz 9 days ago
- Status changed from New to In Progress
- Tracker changed from Idea to Feature
Updated by Peter Amstutz 9 days ago
- Target version changed from Development 2025-01-29 to Development 2025-02-12
Updated by Peter Amstutz 5 days ago
Updated by Peter Amstutz 3 days ago
15:21:12 Failures (6):
15:21:12 Fail: services/crunch-dispatch-local install (0s)
15:21:12 Fail: gofmt tests (18s)
15:21:12 Fail: lib/controller tests (69s)
15:21:12 Fail: lib/crunchrun tests (0s)
15:21:12 Fail: lib/lsf tests (22s)
15:21:12 Fail: services/crunch-dispatch-local tests (1s)
15:20:25 Failures (1):
15:20:25 Fail: services/crunch-dispatch-local install (1s)
Updated by Peter Amstutz 2 days ago
Updated by Peter Amstutz 1 day ago
Updated by Peter Amstutz about 16 hours ago
21926-rocm @ 59f1e8f417b0109ab5948370cfe84a225908489c
- All agreed upon points are implemented / addressed. Describe changes from pre-implementation design.
- Can now request AMD GPUs and they are supported throughout the stack
- Anything not implemented (discovered or discussed during work) has a follow-up story.
- Singularity executor does not yet support ROCm: https://dev.arvados.org/issues/22550
- Code is tested and passing, both automated and manual, what manual testing was done is described.
- Added tests for deprecated config key update
- Fixed tests to ignore deprecated runtime_constraints "cuda" key
- Updated crunch-run tests for generic GPU support
- Updated node size tests for generic GPU support
- Updated LSF tests for generic GPU support
- Updated CWL tests to check that GPU requirements are translated to proper container requests
- Added API server tests to test new API and backwards compatibility with the old 'cuda' API
- Updated crunch-dispatch-local test to account for resource usage management feature
- For manual testing, I built packages and installed them on the prototype arvados appliance and confirmed that I could request and run containers with ROCm support. I still need to do a bit more of this using the latest packages.
- New or changed UX/UX and has gotten feedback from stakeholders.
- n/a
- Documentation has been updated.
- yes
- Behaves appropriately at the intended scale (describe intended scale).
- does not affect scale
- Considered backwards and forwards compatibility issues between client and server.
- The previous way of requesting CUDA GPUs is transparently migrated to the new GPU model
- deprecated config keys are migrated with a warning
- Follows our coding standards and GUI style guidelines.
- yes
This branch does 4 main things:
1.Migrate GPU support from being specialized to CUDA to supporting generic GPUs, parameterized on "stack". This touched a number of components. This makes for a large set of changes and of course it would have been nice if we didn't have to do it, however when I implemented GPU support originally I intentionally made it narrowly scoped because I didn't know how to generalize it in the future. It is now the future, and I have generalized it.
2. Add support for mounting AMD GPU devices into the container in crunch-run's docker executor. I did not add support for singularity (#22550). Singularity still supports CUDA.
3. Migrate arvados-cwl-runner to use the new API and add a new ROCmRequirement.
4. Add basic resource management to crunch-dispatch-local. This was really essential for this branch, because if you let two different processes access the GPU at once, you can have a bad time.
Updated by Peter Amstutz about 9 hours ago
- Related to Feature #22550: Singularity executor supports ROCm added