Feature #20182
closedConfigurable limit percentage of MaxInstances eligible to run top-level or arvados-cwl-runner containers
Description
Currently, if many workflow runners are submitted at once, they might use up all available capacity, and they will all submit child containers and wait forever for them to run, because all available capacity is consumed.
The limit should apply to any container with- the (new) scheduling parameter
supervisor: true
, or - command starting with
arvados-cwl-runner
Arvados-cwl-runner should set supervisor: true
when submitting a supervisor container.
Updated by Peter Amstutz almost 2 years ago
- Story points changed from 1.0 to 2.0
Updated by Peter Amstutz almost 2 years ago
- Target version changed from Future to To be scheduled
Updated by Tom Clegg almost 2 years ago
- Subject changed from Configurable limit percentage of MaxInstances eligible to run top-level containers to Configurable limit percentage of MaxInstances eligible to run top-level or arvados-cwl-runner containers
Updated by Peter Amstutz almost 2 years ago
From discussion:
in the particular user case we have, they are actually starting a bunch of workflows from a "launcher" so the workflows that we want to throttle are not actually top-level processes. So we need to model the throttle a little bit differently.
Suggestion: instead of counting top-level processes, we add a scheduling parameter "supervisor: true" and it counts all the processes which self-identify as "supervisor" processes.
Updated by Peter Amstutz almost 2 years ago
Honestly it would probably be a good idea to also have a special case of containers where the command starts with arvados-cwl-runner
so we don't have to also force everyone to immediately update a-c-r for this feature to work.
Updated by Peter Amstutz almost 2 years ago
- Target version changed from To be scheduled to 2023-03-01 sprint
- Assigned To set to Peter Amstutz
- Status changed from New to In Progress
Updated by Peter Amstutz almost 2 years ago
20182-supervisor-limit @ 4447b5a61f79edf2411ba94f4ad5d90e1ca7e220
Adds a new configuration parameter "Containers.SupervisorFraction". This is multiplied with MaxInstances to determine how many "supervisor" containers are eligible to run. In Scheduler.runQueue(), when it is going through the containers sorted by priority, it counts up the number of supervisor containers and takes no action on additional supervisor containers after reaching the limit.
For the default fraction, I picked 0.3 (30%) as roughly a ~2:1 ratio of useful workers to supervisors.
Updated by Peter Amstutz almost 2 years ago
not sure if the failures are flaky tests. re-run: developer-run-tests: #3519
Updated by Peter Amstutz almost 2 years ago
20182-supervisor-limit @ 53a0cb93d935f02248f4e347ea0689fccf5e818e
There was a real bug, let's see if I fixed it:
Updated by Peter Amstutz almost 2 years ago
Re-running workbench integration test: developer-run-tests-apps-workbench-integration: #3786
Updated by Tom Clegg almost 2 years ago
Scheduler change looks good.
Should copy the "supervisor" flag (true if any are true?) to the new container when retrying in Container#handle_completed in source:services/api/app/models/container.rb
May be less confusing if we automatically set supervisor=true for a-c-r containers in source:services/api/app/models/container_request.rb instead of looking for the special case in the scheduler itself. (Or, if we do stick with detecting it in the scheduler, we need to add command
to the selected fields in (*Queue)poll() in source:lib/dispatchcloud/container/queue.go)
When ratio > 0.0
but int(ratio * maxInstances) == 0
we should probably set the actual limit to 1, not 0=unlimited.
Config knob should probably be under CloudVMs since it only applies to the cloud scheduler and is a ratio of CloudVMs.MaxInstances
container who's purpose is
whose
Number of "supervisor" containers eligible to run at any given time expressed as a fraction of CloudVMs.MaxInstances
How about:
Maximum fraction of CloudVMs.MaxInstances allowed to run "supervisor" containers at any given time
Updated by Peter Amstutz almost 2 years ago
- Target version changed from 2023-03-01 sprint to Development 2023-03-15 sprint
Updated by Peter Amstutz almost 2 years ago
Tom Clegg wrote in #note-16:
Scheduler change looks good.
Should copy the "supervisor" flag (true if any are true?) to the new container when retrying in Container#handle_completed in source:services/api/app/models/container.rb
Good catch.
May be less confusing if we automatically set supervisor=true for a-c-r containers in source:services/api/app/models/container_request.rb instead of looking for the special case in the scheduler itself. (Or, if we do stick with detecting it in the scheduler, we need to add
command
to the selected fields in (*Queue)poll() in source:lib/dispatchcloud/container/queue.go)
I didn't know that the command field wasn't being loaded, so that makes sense. I made the change.
When
ratio > 0.0
butint(ratio * maxInstances) == 0
we should probably set the actual limit to 1, not 0=unlimited.
Fixed.
Config knob should probably be under CloudVMs since it only applies to the cloud scheduler and is a ratio of CloudVMs.MaxInstances
Fixed.
container who's purpose is
whose
Fixed.
Number of "supervisor" containers eligible to run at any given time expressed as a fraction of CloudVMs.MaxInstances
How about:
Maximum fraction of CloudVMs.MaxInstances allowed to run "supervisor" containers at any given time
Fixed.
20182-supervisor-limit @ 4eb591b3d541cc5ac035fb0eed7c39dc8e81dd02
Updated by Peter Amstutz almost 2 years ago
Re-run of developer-run-tests-apps-workbench_benchmark: developer-run-tests-apps-workbench_benchmark: #3538
Updated by Peter Amstutz almost 2 years ago
- Status changed from In Progress to Resolved