Feature #22314
openResource accounting in crunch-dispatch-local
Updated by Peter Amstutz about 1 month ago
- Status changed from New to In Progress
Updated by Peter Amstutz about 1 month ago
- Assigned To set to Peter Amstutz
- Status changed from In Progress to New
- Category set to Crunch
- Tracker changed from Bug to Feature
Updated by Peter Amstutz about 1 month ago
22314-dispatch-rsc @ 5b0e5a375ce34ea60ffb5543e2c6bae03fc4b126
Needs some tests. Also it accounts for GPUs but doesn't actually allocate them explicitly, which may be necessary (something needs to know which GPU devices are busy and allocate a not-busy one).
Updated by Tom Clegg about 1 month ago
- Related to Feature #14922: Run multiple containers concurrently on a single cloud VM added
Updated by Peter Amstutz 30 days ago
Tom Clegg wrote in #note-6:
I think it would be helpful to mention the problems or use-cases this is meant to address.
The current implementation has a fixed concurrency of 8 processes, with no logic to either scale it to the resources of the host nor account for the resource request of the containers. This makes it unsuitable for real work.
I was discussing with Sasha and Zoe how to have a single node install that is more suitable for doing real work. They were talking about setting up SLURM and crunch-dispatch-slurm, and I thought that sounded like a lot of complexity for a single node install, and that the only reason not to use crunch-dispatch-local is the lack of resource management as described above. So I spent a couple hours in the evening putting this branch together to do basic resource management.
Long term, making arvados-dispatch-cloud support allocating multiple containers to a node and using the "loopback" driver would make crunch-dispatch-local redundant, but as it stands today that's likely to take several weeks/months to fully implement because it needs to solve the general problem, whereas implementing the feature in crunch-dispatch-local only took a couple hours.
All that said, Sasha suggested that they might want to use SLURM anyway because they have some additional requirements for the Arvados appliance that were not mentioned in the previous conversation, but that was after I'd already started the branch.
Updated by Peter Amstutz 17 days ago
- Target version changed from Development 2024-12-04 to Development 2025-01-08