Project

General

Profile

Actions

Feature #21926

closed

AMD ROCm GPU support

Added by Peter Amstutz 10 months ago. Updated about 2 months ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Crunch
Target version:
Story points:
-
Release relationship:
Auto

Description

docker run -it --device=/dev/kfd --device=/dev/dri/card0 --device=/dev/dri/renderD128 --group-add=video --network=host --ipc=host --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --shm-size 16G -v [directory binding options] --name [ollama-blablabla] ollama/ollama:rocm

https://rocm.docs.amd.com/projects/install-on-linux/en/latest/how-to/docker.html


Files


Subtasks 1 (0 open1 closed)

Task #22498: Review 21926-rocmResolvedTom Clegg02/15/2025Actions

Related issues 4 (0 open4 closed)

Related to Arvados - Feature #22550: Singularity executor supports ROCmResolvedActions
Related to Arvados - Feature #22568: Update workbench2 to properly display "GPU" runtime constraintResolvedPeter AmstutzActions
Related to Arvados - Feature #22563: compute node ansible playbook to install ROCmResolvedBrett Smith02/17/2025Actions
Related to Arvados - Feature #22314: Resource accounting in crunch-dispatch-localResolvedPeter AmstutzActions
Actions #1

Updated by Peter Amstutz 10 months ago

  • Description updated (diff)
Actions #2

Updated by Peter Amstutz 10 months ago

  • Description updated (diff)
Actions #3

Updated by Peter Amstutz 9 months ago

  • Description updated (diff)
Actions #4

Updated by Peter Amstutz 8 months ago

  • Subject changed from ROCm GPU support to AMD ROCm GPU support
Actions #5

Updated by Peter Amstutz 4 months ago

It works without --network=host --ipc=host --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --shm-size 16G

Those options all reduce security, so if we can make it work with just this, that's way better:

docker run -it --device=/dev/kfd --device=/dev/dri/card0 --device=/dev/dri/renderD128 --group-add=video -v [directory binding options] --name [ollama-blablabla] ollama/ollama:rocm

Actions #6

Updated by Peter Amstutz 4 months ago

Attached package implements prototype ROCm support in crunch run.

If AMD_VISIBLE_DEVICES is set when crunch-run is executed (you can set AMD_VISIBLE_DEVICES before running crunch-dispatch-local) then crunch-run will make the GPU devices available to the container.

Actions #7

Updated by Peter Amstutz 2 months ago

  • Target version changed from Future to Development 2025-01-29
Actions #8

Updated by Peter Amstutz 2 months ago

  • Status changed from New to In Progress
  • Tracker changed from Idea to Feature
Actions #9

Updated by Peter Amstutz 2 months ago

  • Target version changed from Development 2025-01-29 to Development 2025-02-12
Actions #10

Updated by Peter Amstutz 2 months ago

  • Release set to 75
Actions #11

Updated by Peter Amstutz 2 months ago

  • Assigned To set to Peter Amstutz
Actions #12

Updated by Peter Amstutz 2 months ago

  • Subtask #22498 added
Actions #14

Updated by Peter Amstutz 2 months ago

15:21:12 Failures (6):
15:21:12 Fail: services/crunch-dispatch-local install (0s)
15:21:12 Fail: gofmt tests (18s)
15:21:12 Fail: lib/controller tests (69s)
15:21:12 Fail: lib/crunchrun tests (0s)
15:21:12 Fail: lib/lsf tests (22s)
15:21:12 Fail: services/crunch-dispatch-local tests (1s)

15:20:25 Failures (1):
15:20:25 Fail: services/crunch-dispatch-local install (1s)

Actions #17

Updated by Peter Amstutz about 2 months ago

21926-rocm @ 59f1e8f417b0109ab5948370cfe84a225908489c

developer-run-tests: #4650

  • All agreed upon points are implemented / addressed. Describe changes from pre-implementation design.
    • Can now request AMD GPUs and they are supported throughout the stack
  • Anything not implemented (discovered or discussed during work) has a follow-up story.
  • Code is tested and passing, both automated and manual, what manual testing was done is described.
    • Added tests for deprecated config key update
    • Fixed tests to ignore deprecated runtime_constraints "cuda" key
    • Updated crunch-run tests for generic GPU support
    • Updated node size tests for generic GPU support
    • Updated LSF tests for generic GPU support
    • Updated CWL tests to check that GPU requirements are translated to proper container requests
    • Added API server tests to test new API and backwards compatibility with the old 'cuda' API
    • Updated crunch-dispatch-local test to account for resource usage management feature
    • For manual testing, I built packages and installed them on the prototype arvados appliance and confirmed that I could request and run containers with ROCm support. I still need to do a bit more of this using the latest packages.
  • New or changed UX/UX and has gotten feedback from stakeholders.
    • n/a
  • Documentation has been updated.
    • yes
  • Behaves appropriately at the intended scale (describe intended scale).
    • does not affect scale
  • Considered backwards and forwards compatibility issues between client and server.
    • The previous way of requesting CUDA GPUs is transparently migrated to the new GPU model
    • deprecated config keys are migrated with a warning
  • Follows our coding standards and GUI style guidelines.
    • yes

This branch does 4 main things:

1.Migrate GPU support from being specialized to CUDA to supporting generic GPUs, parameterized on "stack". This touched a number of components. This makes for a large set of changes and of course it would have been nice if we didn't have to do it, however when I implemented GPU support originally I intentionally made it narrowly scoped because I didn't know how to generalize it in the future. It is now the future, and I have generalized it.

2. Add support for mounting AMD GPU devices into the container in crunch-run's docker executor. I did not add support for singularity (#22550). Singularity still supports CUDA.

3. Migrate arvados-cwl-runner to use the new API and add a new ROCmRequirement.

4. Add basic resource management to crunch-dispatch-local. This was really essential for this branch, because if you let two different processes access the GPU at once, you can have a bad time.

Actions #18

Updated by Peter Amstutz about 2 months ago

Actions #19

Updated by Peter Amstutz about 2 months ago

21926-rocm @ 8a501f88992b041e8b7f8606af3095020d595b69

developer-run-tests: #4651

Resolved a few issues coming up in manual testing. It is working in manual testing now.

Actions #20

Updated by Tom Clegg about 2 months ago

Singularity docs make it sound like adding ROCm support is as easy as args = append(args, "--rocm"). Is there a reason not to do this? (Maybe just lack of testing?)

As long as singularity + rocm is guaranteed to fail, that should be mentioned in the install/upgrade docs, and we should fail affected containers early -- maybe in source:lib/dispatchcloud/node_size.go and perhaps even in Rails when creating/committing a CR -- since bringing up a GPU node only to write a "refusing to even try this" error message seems unfortunate.

In source:doc/install/crunch2-cloud/install-dispatch-cloud.html.textile.liquid, the descriptions of DriverVersion, HardwareTarget, etc., should be consistent -- either always begin with "The" (like DriverVersion does) or never begin with "The" (like VRAM). I prefer the latter.

In source:doc/install/crunch2-cloud/install-dispatch-cloud.html.textile.liquid, there are a few tab-indents mixed in with space-indents and they mess with indentation in diffs. Style guide says spaces only.

Hints about hardware targets (rocminfo, link to AMD docs) are written slightly differently in two places. Perhaps source:doc/_includes/_container_runtime_constraints.liquid should be phrased in terms of matching the cluster config (which it actually has to match), rather than the actual hardware?

The phrase "automatic migration" might be misleading. Elsewhere in Arvados, migrations modify your database, so this sounds like we're going to modify your config file. Maybe "the old/deprecated config will still be accepted"? Previously we have written:

The following configuration keys have been renamed or removed. Renamed keys will still be loaded if they appear with their old names, but you should update your /etc/arvados/config.yml file to avoid warnings when services start up.

config.default.yml comments should wrap at 70 cols.

config.default.yml comments should be right above the keys they're describing, instead of "here's what various keys mean in the following section".

Can we remove the CUDA field from the RuntimeConstraints struct in Go? I'm thinking:
  • when an old client sends a create/update request with a CUDA section, that gets passed through without being coerced to a ContainerRequest, so I don't think we need it for requests
  • a rails response never has a CUDA section, so it seems like the SDK field just causes controller to add an empty CUDA section to responses, which doesn't seem useful

source:lib/lsf/dispatch_test.go should use arvadostest.ResetDB(c) (oops, I see we have several copies of this code left to clean up)

Submitting a container with an unknown/misspelled stack like "ROCm" or "CUDA" should result in a "no suitable node type" error in source:lib/dispatchcloud/node_size.go, and perhaps a container request create/update request with an unknown stack should be rejected up front by Rails -- it looks like the current code just uses a non-GPU node in such cases, which doesn't sound great.

I think we should test the API behavior we're expecting for old clients, maybe in source:lib/controller/integration_test.go -- maybe, create a CR using the old CUDA spelling, confirm the response has the new GPU spelling, create two more CRs with CUDA and GPU spellings, check the same container gets assigned to all three.

Actions #21

Updated by Peter Amstutz about 2 months ago

Tom Clegg wrote in #note-20:

Singularity docs make it sound like adding ROCm support is as easy as args = append(args, "--rocm"). Is there a reason not to do this? (Maybe just lack of testing?)

There is no reason not to do this, so I did it. The lame reason is that I just didn't take the time to look up the documentation.

In source:doc/install/crunch2-cloud/install-dispatch-cloud.html.textile.liquid, the descriptions of DriverVersion, HardwareTarget, etc., should be consistent -- either always begin with "The" (like DriverVersion does) or never begin with "The" (like VRAM). I prefer the latter.

In source:doc/install/crunch2-cloud/install-dispatch-cloud.html.textile.liquid, there are a few tab-indents mixed in with space-indents and they mess with indentation in diffs. Style guide says spaces only.

Fixed.

Hints about hardware targets (rocminfo, link to AMD docs) are written slightly differently in two places. Perhaps source:doc/_includes/_container_runtime_constraints.liquid should be phrased in terms of matching the cluster config (which it actually has to match), rather than the actual hardware?

Now says "To be scheduled, at least one item in this list must match the HardwareTarget of one of the cluster's InstanceTypes"

The phrase "automatic migration" might be misleading. Elsewhere in Arvados, migrations modify your database, so this sounds like we're going to modify your config file. Maybe "the old/deprecated config will still be accepted"?

Now says "To minimize disruption, the config loader will continue to accept the deprecated CUDA field and a emit warning. Admins are advised to update the configuration file as the legacy field will be removed in a future version."

config.default.yml comments should wrap at 70 cols.

config.default.yml comments should be right above the keys they're describing, instead of "here's what various keys mean in the following section".

Fixed.

Can we remove the CUDA field from the RuntimeConstraints struct in Go? I'm thinking:
  • when an old client sends a create/update request with a CUDA section, that gets passed through without being coerced to a ContainerRequest, so I don't think we need it for requests
  • a rails response never has a CUDA section, so it seems like the SDK field just causes controller to add an empty CUDA section to responses, which doesn't seem useful

I don't think so. I remember talking about it. At minimum, it appears to transit through Go ContainerRequest struct in the federation case. I think we also discussed "what if the client sent a 'cuda' field and was surprised when it didn't get one back". Right now what should happen is that if a 'cuda' field was provided, it returns both the original 'cuda' field and the translated 'gpu' one.

source:lib/lsf/dispatch_test.go should use arvadostest.ResetDB(c) (oops, I see we have several copies of this code left to clean up)

Fixed.

Submitting a container with an unknown/misspelled stack like "ROCm" or "CUDA" should result in a "no suitable node type" error in source:lib/dispatchcloud/node_size.go, and perhaps a container request create/update request with an unknown stack should be rejected up front by Rails -- it looks like the current code just uses a non-GPU node in such cases, which doesn't sound great.

Added it to the validation.

I think we should test the API behavior we're expecting for old clients, maybe in source:lib/controller/integration_test.go -- maybe, create a CR using the old CUDA spelling, confirm the response has the new GPU spelling, create two more CRs with CUDA and GPU spellings, check the same container gets assigned to all three.

The new tests in source:services/api/test/unit/container_test.rb covers most of this, are you suggesting an integration test would be better?

21926-rocm @ 4227434c76ce326b3626a3031705e1c68b0beea4

developer-run-tests: #4658

Actions #22

Updated by Peter Amstutz about 2 months ago

  • Target version changed from Development 2025-02-12 to Development 2025-02-26
Actions #23

Updated by Peter Amstutz about 2 months ago

  • Related to Feature #22568: Update workbench2 to properly display "GPU" runtime constraint added
Actions #24

Updated by Peter Amstutz about 2 months ago

  • Related to Feature #22563: compute node ansible playbook to install ROCm added
Actions #27

Updated by Tom Clegg about 2 months ago

Peter Amstutz wrote in #note-21:

Can we remove the CUDA field from the RuntimeConstraints struct in Go? I'm thinking:
  • when an old client sends a create/update request with a CUDA section, that gets passed through without being coerced to a ContainerRequest, so I don't think we need it for requests
  • a rails response never has a CUDA section, so it seems like the SDK field just causes controller to add an empty CUDA section to responses, which doesn't seem useful

I don't think so. I remember talking about it. At minimum, it appears to transit through Go ContainerRequest struct in the federation case. I think we also discussed "what if the client sent a 'cuda' field and was surprised when it didn't get one back". Right now what should happen is that if a 'cuda' field was provided, it returns both the original 'cuda' field and the translated 'gpu' one.

The RuntimeConstraints struct has both 'cuda' and 'gpu', so both will be in every container_request in a response. If Rails didn't provide values for 'cuda', it will have zero values. This was caught by HandlerSuite.TestGetObjects in lib/controller before you added an exemption for it:

handler_test.go:651:
    c.Check(val, check.DeepEquals, direct[k],
        check.Commentf("RailsAPI %s key %q's value %q differs from controller's %q.", direct["kind"], k, direct[k], val))
... obtained map[string]interface {} = map[string]interface {}{"API":false, "cuda":map[string]interface {}{"device_count":0, "driver_version":"", "hardware_capability":""}, "gpu":map[string]interface {}{"device_count":0, "driver_version":"", "hardware_target":[]interface {}{}, "stack":"", "vram":0}, "keep_cache_disk":0, "keep_cache_ram":0, "ram":123, "vcpus":1}
... expected map[string]interface {} = map[string]interface {}{"API":false, "gpu":map[string]interface {}{"device_count":0, "driver_version":"", "hardware_target":[]interface {}{}, "stack":"", "vram":0}, "keep_cache_disk":0, "keep_cache_ram":0, "ram":123, "vcpus":1}
... RailsAPI arvados#containerRequest key "runtime_constraints"'s value map["API":%!q(bool=false) "gpu":map["device_count":%!q(float64=0) "driver_version":"" "hardware_target":[] "stack":"" "vram":%!q(float64=0)] "keep_cache_disk":%!q(float64=0) "keep_cache_ram":%!q(float64=0) "ram":%!q(float64=123) "vcpus":%!q(float64=1)] differs from controller's map["API":%!q(bool=false) "cuda":map["device_count":%!q(float64=0) "driver_version":"" "hardware_capability":""] "gpu":map["device_count":%!q(float64=0) "driver_version":"" "hardware_target":[] "stack":"" "vram":%!q(float64=0)] "keep_cache_disk":%!q(float64=0) "keep_cache_ram":%!q(float64=0) "ram":%!q(float64=123) "vcpus":%!q(float64=1)].
... Difference:
...     ["cuda"]: map["device_count":%!q(float64=0) "driver_version":"" "hardware_capability":""] != (missing)

I think that is the wrong behavior and we should fix it rather than add an exemption to the test case to allow it.

I was expecting we would remove the CUDA section from the RuntimeConstraints struct, and translate old-style to new-style on the fly in Rails.

As long as we just report whatever went into the database at the time the CR was created, controller and all clients that use/display this information for existing CRs will need to maintain their own code to handle both styles.

Say an old client creates/updates a CR, and the CUDA section is missing in the response. Is this a real problem for a client we know about?

I think we should test the API behavior we're expecting for old clients, maybe in source:lib/controller/integration_test.go -- maybe, create a CR using the old CUDA spelling, confirm the response has the new GPU spelling, create two more CRs with CUDA and GPU spellings, check the same container gets assigned to all three.

The new tests in source:services/api/test/unit/container_test.rb covers most of this, are you suggesting an integration test would be better?

Yes, I'm specifically looking to make sure an old client can actually create a CR with GPU requirements, and the resulting container has GPU requirements, and the response looks the way we want it to. This should go through controller, because controller writes the relevant response.

(Other fixes above look good, thanks)

Actions #28

Updated by Peter Amstutz about 2 months ago

21926-rocm @ 72cdecb7a572de9d853480732ecaeba1aaa63eae

developer-run-tests: #4666

Removed CUDA from RuntimeConstraints struct, and added integration test that confirm that they're still passed through to API server. Found & fixed a bug in the API server translation that meant it wasn't saving the 'gpu' runtime_constraint properly.

Actions #29

Updated by Tom Clegg about 2 months ago

Just a nit, this might as well go all the way back to nil like its neighbors:

                "containers/" + arvadostest.RunningContainerUUID:               nil,
-               "container_requests/" + arvadostest.QueuedContainerRequestUUID: {"runtime_constraints": "cuda"},
+               "container_requests/" + arvadostest.QueuedContainerRequestUUID: {},
                "groups/" + arvadostest.AProjectUUID:                           nil,

For that matter, we could revert all the changes to source:lib/controller/handler_test.go on this branch.

IDK what the test failures are about. Perhaps more related to Jenkins image than this branch?

Assuming so, LGTM.

Actions #30

Updated by Brett Smith about 2 months ago

Tom Clegg wrote in #note-29:

IDK what the test failures are about. Perhaps more related to Jenkins image than this branch?

I don't see any evidence of that. Remember, Jenkins is still running test jobs with the same image it's been using for a month.

I think this error is most concerning:

======= test sdk/go/keepclient
child.pid is 99782
child.pid is 99797
Sent SIGTERM to 99782 (/tmp/workspace/run-tests-remainder/tmp/keep0.pid)
Sent SIGTERM to 99797 (/tmp/workspace/run-tests-remainder/tmp/keep1.pid)
2025/02/14 17:14:18 http: panic serving 127.0.0.1:53528: runtime error: invalid memory address or nil pointer dereference
goroutine 47123 [running]:
net/http.(*conn).serve.func1()
    /home/jenkins/tmp/GOPATH/pkg/mod/golang.org/toolchain@v0.0.1-go1.22.4.linux-amd64/src/net/http/server.go:1898 +0xbe
panic({0xa9e340?, 0x10ca6a0?})
    /home/jenkins/tmp/GOPATH/pkg/mod/golang.org/toolchain@v0.0.1-go1.22.4.linux-amd64/src/runtime/panic.go:770 +0x132
git.arvados.org/arvados.git/sdk/go/keepclient.(*FailThenSucceedHandler).ServeHTTP(0xc002b2fa00, {0xc69860, 0xc001ad68c0}, 0xc0087aa000)
    /tmp/workspace/run-tests-remainder/sdk/go/keepclient/keepclient_test.go:437 +0xcf
net/http.serverHandler.ServeHTTP({0xc67630?}, {0xc69860?, 0xc001ad68c0?}, 0x6?)
    /home/jenkins/tmp/GOPATH/pkg/mod/golang.org/toolchain@v0.0.1-go1.22.4.linux-amd64/src/net/http/server.go:3137 +0x8e
net/http.(*conn).serve(0xc004537950, {0xc6b658, 0xc00462b0e0})
    /home/jenkins/tmp/GOPATH/pkg/mod/golang.org/toolchain@v0.0.1-go1.22.4.linux-amd64/src/net/http/server.go:2039 +0x5e8
created by net/http.(*Server).Serve in goroutine 47056
    /home/jenkins/tmp/GOPATH/pkg/mod/golang.org/toolchain@v0.0.1-go1.22.4.linux-amd64/src/net/http/server.go:3285 +0x4b4

----------------------------------------------------------------------
FAIL: keepclient_test.go:1512: StandaloneSuite.TestPutBRetry

=== initial delay 0s
keepclient_test.go:1557:
    c.Check(err, IsNil)
... value keepclient.InsufficientReplicasError = keepclient.InsufficientReplicasError{error:(*errors.errorString)(0xc003bd0810)} ("Could not write sufficient replicas: [500] 500 Internal Server Error")

keepclient_test.go:1559:
    c.Check(replicas, Equals, 2)
... obtained int = 1
... expected int = 2

=== initial delay 1ns
=== initial delay 125ms
=== initial delay 62.5ms
OOPS: 53 passed, 1 FAILED
--- FAIL: Test (6.38s)
FAIL
coverage: 85.8% of statements
FAIL    git.arvados.org/arvados.git/sdk/go/keepclient    6.399s
FAIL
======= sdk/go/keepclient tests -- FAILED

If keepstore is panicking serving blocks, that could be a possible explanation for the "can't mmap empty file" that arv-mount reported throughout the lib/crunchrun tests: if arv-mount just writes the response to a local cache file, then tries to mmap it without checking it got a good response (I hope not, but…)

I wonder if we maybe actually do have a go1.22 upgrade issue?

Actions #31

Updated by Brett Smith about 2 months ago

I should add: we did see another Jenkins job fail because the node ran out of space. build-packages-ubuntu2004: #2059

I don't think this is a problem with the image per se, but because we have more jobs using the same image, we're getting more node reuse than we have in the past, and it looks like a node might be able to accumulate enough cruft to fill our current disk allocation.

That's just a number in Jenkins, that's easy to change. But, at the same time, I'm also of the opinion that keepstore shouldn't panic. Is it possible that it does in a ENOSPC situation? We should also fix that if so.

Actions #33

Updated by Tom Clegg about 2 months ago

Yeah, this is a mystery to me. Thoughts:
  • This isn't keepstore crashing, it's a stub (FailThenSucceedHandler wrapping StubPutHandler)
  • I haven't been able to reproduce this locally
  • I don't see the connection to the changes in this branch
  • I don't see a pointer that can be nil at source:sdk/go/keepclient/keepclient_test.go#L437 with or without a full disk
    fh.reqIDs = append(fh.reqIDs, req.Header.Get("X-Request-Id"))
  • The same test passed on go 1.22 and go 1.23
Actions #34

Updated by Peter Amstutz about 2 months ago

  • Status changed from In Progress to Resolved
Actions #35

Updated by Peter Amstutz about 2 months ago

  • Related to Feature #22314: Resource accounting in crunch-dispatch-local added
Actions #36

Updated by Peter Amstutz about 2 months ago

(pretty sure it was an out of disk space error, since the subsequent re-run passed)

This is merged now.

Actions

Also available in: Atom PDF