Project

General

Profile

Actions

Feature #22314

open

Resource accounting in crunch-dispatch-local

Added by Peter Amstutz about 1 month ago. Updated 17 days ago.

Status:
In Progress
Priority:
Normal
Assigned To:
Category:
Crunch
Target version:
Story points:
-

Subtasks 1 (1 open0 closed)

Task #22345: ReviewNewTom CleggActions

Related issues 1 (1 open0 closed)

Related to Arvados - Feature #14922: Run multiple containers concurrently on a single cloud VMNewActions
Actions #1

Updated by Peter Amstutz about 1 month ago

  • Status changed from New to In Progress
Actions #2

Updated by Peter Amstutz about 1 month ago

  • Assigned To set to Peter Amstutz
  • Status changed from In Progress to New
  • Category set to Crunch
  • Tracker changed from Bug to Feature
Actions #3

Updated by Peter Amstutz about 1 month ago

22314-dispatch-rsc @ 5b0e5a375ce34ea60ffb5543e2c6bae03fc4b126

Needs some tests. Also it accounts for GPUs but doesn't actually allocate them explicitly, which may be necessary (something needs to know which GPU devices are busy and allocate a not-busy one).

Actions #4

Updated by Tom Clegg about 1 month ago

  • Related to Feature #14922: Run multiple containers concurrently on a single cloud VM added
Actions #5

Updated by Peter Amstutz 30 days ago

  • Status changed from New to In Progress
Actions #6

Updated by Tom Clegg 30 days ago

I think it would be helpful to mention the problems or use-cases this is meant to address.

Actions #7

Updated by Peter Amstutz 30 days ago

Tom Clegg wrote in #note-6:

I think it would be helpful to mention the problems or use-cases this is meant to address.

The current implementation has a fixed concurrency of 8 processes, with no logic to either scale it to the resources of the host nor account for the resource request of the containers. This makes it unsuitable for real work.

I was discussing with Sasha and Zoe how to have a single node install that is more suitable for doing real work. They were talking about setting up SLURM and crunch-dispatch-slurm, and I thought that sounded like a lot of complexity for a single node install, and that the only reason not to use crunch-dispatch-local is the lack of resource management as described above. So I spent a couple hours in the evening putting this branch together to do basic resource management.

Long term, making arvados-dispatch-cloud support allocating multiple containers to a node and using the "loopback" driver would make crunch-dispatch-local redundant, but as it stands today that's likely to take several weeks/months to fully implement because it needs to solve the general problem, whereas implementing the feature in crunch-dispatch-local only took a couple hours.

All that said, Sasha suggested that they might want to use SLURM anyway because they have some additional requirements for the Arvados appliance that were not mentioned in the previous conversation, but that was after I'd already started the branch.

Actions #8

Updated by Tom Clegg 30 days ago

Peter Amstutz wrote in #note-7:

a single node install that is more suitable for doing real work

That is helpful context, thanks.

Presumably we'll still do #14922 in due course.

(I take it the branch is not ready for review)

Actions #9

Updated by Peter Amstutz 17 days ago

  • Target version changed from Development 2024-12-04 to Development 2025-01-08
Actions

Also available in: Atom PDF