Project

General

Profile

Actions

Feature #20182

closed

Configurable limit percentage of MaxInstances eligible to run top-level or arvados-cwl-runner containers

Added by Tom Clegg almost 2 years ago. Updated almost 2 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Story points:
2.0
Release relationship:
Auto

Description

Currently, if many workflow runners are submitted at once, they might use up all available capacity, and they will all submit child containers and wait forever for them to run, because all available capacity is consumed.

The limit should apply to any container with
  • the (new) scheduling parameter supervisor: true, or
  • command starting with arvados-cwl-runner

Arvados-cwl-runner should set supervisor: true when submitting a supervisor container.


Subtasks 1 (0 open1 closed)

Task #20186: Review 20182-supervisor-limitResolvedPeter Amstutz03/02/2023Actions
Actions #1

Updated by Tom Clegg almost 2 years ago

  • Description updated (diff)
Actions #2

Updated by Peter Amstutz almost 2 years ago

  • Story points set to 1.0
Actions #3

Updated by Peter Amstutz almost 2 years ago

  • Story points changed from 1.0 to 2.0
Actions #4

Updated by Peter Amstutz almost 2 years ago

  • Target version changed from Future to To be scheduled
Actions #5

Updated by Tom Clegg almost 2 years ago

  • Subject changed from Configurable limit percentage of MaxInstances eligible to run top-level containers to Configurable limit percentage of MaxInstances eligible to run top-level or arvados-cwl-runner containers
Actions #6

Updated by Peter Amstutz almost 2 years ago

From discussion:

in the particular user case we have, they are actually starting a bunch of workflows from a "launcher" so the workflows that we want to throttle are not actually top-level processes. So we need to model the throttle a little bit differently.

Suggestion: instead of counting top-level processes, we add a scheduling parameter "supervisor: true" and it counts all the processes which self-identify as "supervisor" processes.

Actions #7

Updated by Tom Clegg almost 2 years ago

  • Description updated (diff)
Actions #8

Updated by Peter Amstutz almost 2 years ago

Honestly it would probably be a good idea to also have a special case of containers where the command starts with arvados-cwl-runner so we don't have to also force everyone to immediately update a-c-r for this feature to work.

Actions #9

Updated by Tom Clegg almost 2 years ago

  • Description updated (diff)
Actions #10

Updated by Peter Amstutz almost 2 years ago

  • Target version changed from To be scheduled to 2023-03-01 sprint
  • Assigned To set to Peter Amstutz
  • Status changed from New to In Progress
Actions #11

Updated by Peter Amstutz almost 2 years ago

20182-supervisor-limit @ 4447b5a61f79edf2411ba94f4ad5d90e1ca7e220

Adds a new configuration parameter "Containers.SupervisorFraction". This is multiplied with MaxInstances to determine how many "supervisor" containers are eligible to run. In Scheduler.runQueue(), when it is going through the containers sorted by priority, it counts up the number of supervisor containers and takes no action on additional supervisor containers after reaching the limit.

For the default fraction, I picked 0.3 (30%) as roughly a ~2:1 ratio of useful workers to supervisors.

developer-run-tests: #3518

Actions #13

Updated by Peter Amstutz almost 2 years ago

not sure if the failures are flaky tests. re-run: developer-run-tests: #3519

Actions #14

Updated by Peter Amstutz almost 2 years ago

20182-supervisor-limit @ 53a0cb93d935f02248f4e347ea0689fccf5e818e

There was a real bug, let's see if I fixed it:

developer-run-tests: #3520

Actions #15

Updated by Peter Amstutz almost 2 years ago

Re-running workbench integration test: developer-run-tests-apps-workbench-integration: #3786

Actions #16

Updated by Tom Clegg almost 2 years ago

Scheduler change looks good.

Should copy the "supervisor" flag (true if any are true?) to the new container when retrying in Container#handle_completed in source:services/api/app/models/container.rb

May be less confusing if we automatically set supervisor=true for a-c-r containers in source:services/api/app/models/container_request.rb instead of looking for the special case in the scheduler itself. (Or, if we do stick with detecting it in the scheduler, we need to add command to the selected fields in (*Queue)poll() in source:lib/dispatchcloud/container/queue.go)

When ratio > 0.0 but int(ratio * maxInstances) == 0 we should probably set the actual limit to 1, not 0=unlimited.

Config knob should probably be under CloudVMs since it only applies to the cloud scheduler and is a ratio of CloudVMs.MaxInstances

container who's purpose is

whose

Number of "supervisor" containers eligible to run at any given time expressed as a fraction of CloudVMs.MaxInstances

How about:

Maximum fraction of CloudVMs.MaxInstances allowed to run "supervisor" containers at any given time

Actions #17

Updated by Peter Amstutz almost 2 years ago

  • Target version changed from 2023-03-01 sprint to Development 2023-03-15 sprint
Actions #18

Updated by Peter Amstutz almost 2 years ago

Tom Clegg wrote in #note-16:

Scheduler change looks good.

Should copy the "supervisor" flag (true if any are true?) to the new container when retrying in Container#handle_completed in source:services/api/app/models/container.rb

Good catch.

May be less confusing if we automatically set supervisor=true for a-c-r containers in source:services/api/app/models/container_request.rb instead of looking for the special case in the scheduler itself. (Or, if we do stick with detecting it in the scheduler, we need to add command to the selected fields in (*Queue)poll() in source:lib/dispatchcloud/container/queue.go)

I didn't know that the command field wasn't being loaded, so that makes sense. I made the change.

When ratio > 0.0 but int(ratio * maxInstances) == 0 we should probably set the actual limit to 1, not 0=unlimited.

Fixed.

Config knob should probably be under CloudVMs since it only applies to the cloud scheduler and is a ratio of CloudVMs.MaxInstances

Fixed.

container who's purpose is

whose

Fixed.

Number of "supervisor" containers eligible to run at any given time expressed as a fraction of CloudVMs.MaxInstances

How about:

Maximum fraction of CloudVMs.MaxInstances allowed to run "supervisor" containers at any given time

Fixed.

20182-supervisor-limit @ 4eb591b3d541cc5ac035fb0eed7c39dc8e81dd02

developer-run-tests: #3524

Actions #19

Updated by Tom Clegg almost 2 years ago

LGTM, thanks

Actions #20

Updated by Peter Amstutz almost 2 years ago

Re-run of developer-run-tests-apps-workbench_benchmark: developer-run-tests-apps-workbench_benchmark: #3538

Actions #21

Updated by Peter Amstutz almost 2 years ago

  • Status changed from In Progress to Resolved
Actions #22

Updated by Peter Amstutz almost 2 years ago

  • Release set to 57
Actions

Also available in: Atom PDF