Feature #14922
openRun multiple containers concurrently on a single cloud VM
Description
- the occupied VM is the same price as the instance type that would normally be chosen for the new container
- the occupied VM has enough unallocated RAM, scratch, and VCPUs to accommodate the new container
- either all containers on the VM allocate >0 VCPUs, or the instance will still have non-zero unallocated space after adding the new container (this ensures that an N-VCPU container does not share an N-VCPU instance with anything, even 0-VCPU containers)
- the occupied VM has IdleBehavior=run (not hold or drain)
If multiple occupied VMs satisfy these criteria, choose the one that has the most containers already running, or the most RAM already occupied. This will tend to drain the pool of shared VMs rather than keeping many underutilized VMs alive after a busy period subsides to a less-busy period.
Typically, "same price as a dedicated node, but has spare capacity" will only happen with the cheapest instance type, but it might also apply to larger sizes if the menu has big size steps. Either way, this rule avoids the risk of wasting money by scheduling small long-running containers onto big nodes. In future, this rule may be configurable (to accommodate workloads that benefit more by sharing than they lose by underusing nodes).
Typically, the smallest node type has 1 VCPU, so this feature is useful only if container requests and containers can either- specify a minimum of zero CPUs, or
- specify a fractional number of CPUs.
...so ensure at least one of those is possible.
Notes¶
This should work for the case we need it to because the minimum node size is something like 2 cores / 2-4 GiB and the containers we want to run should (ideally) be requesting less than 1 GiB.
We should have a cluster-wide config knob to turn this off.
Updated by Tom Morris almost 6 years ago
- Target version changed from To Be Groomed to Arvados Future Sprints
Updated by Tom Clegg over 5 years ago
- Related to Feature #15370: [arvados-dispatch-cloud] loopback driver added
Updated by Tom Clegg over 5 years ago
- Subject changed from [crunch-dispatch-cloud] Run multiple containers on a single VM to [crunch-dispatch-cloud] Run multiple containers concurrently on a single VM
Updated by Peter Amstutz over 3 years ago
- Target version deleted (
Arvados Future Sprints)
Updated by Brett Smith over 1 year ago
- Related to Idea #20473: Automated scalability regression test added
Updated by Peter Amstutz over 1 year ago
- Story points deleted (
2.0) - Target version set to Future
- Description updated (diff)
- Subject changed from [crunch-dispatch-cloud] Run multiple containers concurrently on a single VM to Run multiple containers concurrently on a single cloud VM
Updated by Peter Amstutz over 1 year ago
A few additional thoughts
For me, the hard part of this story isn't so much packing containers onto instances as deciding when it is appropriate to start up a larger instance in order to have capacity to schedule multiple containers. This isn't HPC where we have a static set of nodes we work with.
One approach is to look ahead in the queue and pick out containers that fit together nicely, then schedule them together.
However, during the start-up phase of a batch, workflows are being submitted every 6-12 seconds. At this point, it's possible we may not have a substantial backlog of containers that would allow us to look at the queue and pack 4 or 8 jobs to an instance.
So another strategy would be that, when a small container doesn't have an instance, we prospectively start somewhat larger instance, e.g. 4 or 8 cores, and then additional containers in the queue get scheduled onto the instance.
Right now the cloud dispatcher has a pretty strong model of 1:1 containers to instances. Perhaps we could use a strategy where we subdivide an instance into 4 or 8 fractional instances, and schedule 1:1 on these fractional instances.
This would mean requesting a fractional instance would start up the full size instance but then add 4-8 fractional instances to the pool.
Scheduling to a fractional instance would mean finding an instance with the least spare capacity greater than 0, the running it there (this is to discourage situations where you have a bunch of full size instances that are each running 1 container).
When all the fractional slots of a full size instance are idle, the full size instance can be shut down.
Updated by Peter Amstutz over 1 year ago
- Description updated (diff)
Previous discussion (PA)
We have a customer which has limitations on the number of instances they can run, but not on the size of those instances -- this seems to be due to network management policy, the instances are launched in a subnet with a limited number of IP addresses.
As a result, allocating larger VMs and running multiple containers per VM, even when there is minimal cost savings, makes sense as a scaling strategy.
Given a queue of pending containers, we would like an algorithm that instead of considering a single container at a time, looks ahead in the queue and finds opportunities to boot a larger instance that can accommodate multiple containers.
The return on investment will be to optimize supervisor containers (eg arvados-cwl-runner) that are generally scheduled with relatively lightweight resource requirements and spend a lot of time waiting.
So, a simplifying assumption to avoid solving the general problem could be to focus on small container optimization and only try to co-schedule containers that (for example) request 1 core and less than 2 GiB of RAM.
Updated by Peter Amstutz over 1 year ago
- Related to Bug #20801: Crunch discountConfiguredRAMPercent math seems surprising, undesirable added
Updated by Peter Amstutz over 1 year ago
- Story points set to 5.0
- Target version changed from Future to To be scheduled
Updated by Peter Amstutz over 1 year ago
- Related to Idea #20599: Scaling to 1000s of concurrent containers added
Updated by Peter Amstutz about 1 year ago
Another thought:
We have the notion of "supervisor" containers.
We could have a configured instance type specifically for "supervisor" containers, along with a maximum number of containers we're willing to run -- so we could take a 4 gigabyte instance and specify that we want to schedule 6 or 8 workflow runners on it on the assumption that they won't all use their full RAM or CPU quota.
Updated by Peter Amstutz 11 months ago
- Target version changed from To be scheduled to Future
Updated by Tom Clegg 2 months ago
- Related to Feature #22314: Resource accounting in crunch-dispatch-local added