Project

General

Profile

Actions

Feature #14922

open

Run multiple containers concurrently on a single cloud VM

Added by Tom Clegg about 5 years ago. Updated about 2 months ago.

Status:
New
Priority:
Normal
Assigned To:
-
Category:
Crunch
Target version:
Story points:
5.0

Description

Run a new container on an already-occupied VM (instead of using an idle VM or creating a new one) when the following conditions apply:
  • the occupied VM is the same price as the instance type that would normally be chosen for the new container
  • the occupied VM has enough unallocated RAM, scratch, and VCPUs to accommodate the new container
  • either all containers on the VM allocate >0 VCPUs, or the instance will still have non-zero unallocated space after adding the new container (this ensures that an N-VCPU container does not share an N-VCPU instance with anything, even 0-VCPU containers)
  • the occupied VM has IdleBehavior=run (not hold or drain)

If multiple occupied VMs satisfy these criteria, choose the one that has the most containers already running, or the most RAM already occupied. This will tend to drain the pool of shared VMs rather than keeping many underutilized VMs alive after a busy period subsides to a less-busy period.

Typically, "same price as a dedicated node, but has spare capacity" will only happen with the cheapest instance type, but it might also apply to larger sizes if the menu has big size steps. Either way, this rule avoids the risk of wasting money by scheduling small long-running containers onto big nodes. In future, this rule may be configurable (to accommodate workloads that benefit more by sharing than they lose by underusing nodes).

Typically, the smallest node type has 1 VCPU, so this feature is useful only if container requests and containers can either
  • specify a minimum of zero CPUs, or
  • specify a fractional number of CPUs.

...so ensure at least one of those is possible.

Notes

This should work for the case we need it to because the minimum node size is something like 2 cores / 2-4 GiB and the containers we want to run should (ideally) be requesting less than 1 GiB.

We should have a cluster-wide config knob to turn this off.


Related issues

Related to Arvados - Feature #15370: [arvados-dispatch-cloud] loopback driverResolvedTom Clegg05/17/2022Actions
Related to Arvados - Idea #20473: Automated scalability regression testNewActions
Related to Arvados - Bug #20801: Crunch discountConfiguredRAMPercent math seems surprising, undesirableNewActions
Related to Arvados Epics - Idea #20599: Scaling to 1000s of concurrent containersResolved06/01/202303/31/2024Actions
Actions #1

Updated by Tom Clegg about 5 years ago

  • Description updated (diff)
Actions #2

Updated by Tom Morris about 5 years ago

  • Story points set to 2.0
Actions #3

Updated by Tom Morris about 5 years ago

  • Target version changed from To Be Groomed to Arvados Future Sprints
Actions #4

Updated by Tom Clegg almost 5 years ago

  • Related to Feature #15370: [arvados-dispatch-cloud] loopback driver added
Actions #5

Updated by Tom Clegg over 4 years ago

  • Subject changed from [crunch-dispatch-cloud] Run multiple containers on a single VM to [crunch-dispatch-cloud] Run multiple containers concurrently on a single VM
Actions #6

Updated by Peter Amstutz almost 3 years ago

  • Target version deleted (Arvados Future Sprints)
Actions #7

Updated by Peter Amstutz about 1 year ago

  • Release set to 60
Actions #8

Updated by Brett Smith 12 months ago

  • Related to Idea #20473: Automated scalability regression test added
Actions #9

Updated by Peter Amstutz 9 months ago

  • Story points deleted (2.0)
  • Target version set to Future
  • Description updated (diff)
  • Subject changed from [crunch-dispatch-cloud] Run multiple containers concurrently on a single VM to Run multiple containers concurrently on a single cloud VM
Actions #10

Updated by Peter Amstutz 9 months ago

  • Release deleted (60)
Actions #11

Updated by Peter Amstutz 8 months ago

A few additional thoughts

For me, the hard part of this story isn't so much packing containers onto instances as deciding when it is appropriate to start up a larger instance in order to have capacity to schedule multiple containers. This isn't HPC where we have a static set of nodes we work with.

One approach is to look ahead in the queue and pick out containers that fit together nicely, then schedule them together.

However, during the start-up phase of a batch, workflows are being submitted every 6-12 seconds. At this point, it's possible we may not have a substantial backlog of containers that would allow us to look at the queue and pack 4 or 8 jobs to an instance.

So another strategy would be that, when a small container doesn't have an instance, we prospectively start somewhat larger instance, e.g. 4 or 8 cores, and then additional containers in the queue get scheduled onto the instance.

Right now the cloud dispatcher has a pretty strong model of 1:1 containers to instances. Perhaps we could use a strategy where we subdivide an instance into 4 or 8 fractional instances, and schedule 1:1 on these fractional instances.

This would mean requesting a fractional instance would start up the full size instance but then add 4-8 fractional instances to the pool.

Scheduling to a fractional instance would mean finding an instance with the least spare capacity greater than 0, the running it there (this is to discourage situations where you have a bunch of full size instances that are each running 1 container).

When all the fractional slots of a full size instance are idle, the full size instance can be shut down.

Actions #12

Updated by Peter Amstutz 8 months ago

  • Description updated (diff)

Previous discussion (PA)

We have a customer which has limitations on the number of instances they can run, but not on the size of those instances -- this seems to be due to network management policy, the instances are launched in a subnet with a limited number of IP addresses.

As a result, allocating larger VMs and running multiple containers per VM, even when there is minimal cost savings, makes sense as a scaling strategy.

Given a queue of pending containers, we would like an algorithm that instead of considering a single container at a time, looks ahead in the queue and finds opportunities to boot a larger instance that can accommodate multiple containers.

The return on investment will be to optimize supervisor containers (eg arvados-cwl-runner) that are generally scheduled with relatively lightweight resource requirements and spend a lot of time waiting.

So, a simplifying assumption to avoid solving the general problem could be to focus on small container optimization and only try to co-schedule containers that (for example) request 1 core and less than 2 GiB of RAM.

Actions #13

Updated by Peter Amstutz 8 months ago

  • Description updated (diff)
Actions #14

Updated by Peter Amstutz 8 months ago

  • Related to Bug #20801: Crunch discountConfiguredRAMPercent math seems surprising, undesirable added
Actions #15

Updated by Peter Amstutz 8 months ago

  • Story points set to 5.0
  • Target version changed from Future to To be scheduled
Actions #16

Updated by Peter Amstutz 7 months ago

  • Related to Idea #20599: Scaling to 1000s of concurrent containers added
Actions #17

Updated by Tom Clegg 7 months ago

  • Description updated (diff)
Actions #18

Updated by Peter Amstutz 4 months ago

Another thought:

We have the notion of "supervisor" containers.

We could have a configured instance type specifically for "supervisor" containers, along with a maximum number of containers we're willing to run -- so we could take a 4 gigabyte instance and specify that we want to schedule 6 or 8 workflow runners on it on the assumption that they won't all use their full RAM or CPU quota.

Actions #19

Updated by Peter Amstutz about 2 months ago

  • Target version changed from To be scheduled to Future
Actions

Also available in: Atom PDF