Project

General

Profile

Actions

Feature #16316

open

a-c-r handles resource range requests (especially CPU) and adjusts requests based on what is in InstanceTypes list

Added by Peter Amstutz about 4 years ago. Updated 4 months ago.

Status:
In Progress
Priority:
Normal
Assigned To:
Category:
CWL
Target version:
Story points:
2.0

Description

Implement a version of select_resources for Arvados.

You can get a dictionary of instance types with this:

api.config()["InstanceTypes"]

The select_resources method should, at minimum, accept a range of CPU core values (e.g. coresMin: 4, coresMax: 16) and then check the available InstanceTypes and assign the greatest core count available. For example, if the system is only configured with 2, 4, and 8 core nodes, it should assign 8 cores since it is in the range (4 - 16).

RAM and disk can also have a range. Just return the minimum value for now (this is the existing behavior).

Tell cwltool to use your select_resources method by setting the object field runtimeContext.select_resources.


Subtasks 1 (1 open0 closed)

Task #16354: ReviewNewPeter AmstutzActions

Related issues

Related to Arvados Epics - Idea #20273: More CWL runner improvementsNewActions
Related to Arvados Epics - Idea #18179: Better spot instance supportIn Progress03/01/202203/31/2024Actions
Related to Arvados - Feature #20978: Support multiple candidate instance types to assign containersResolvedTom Clegg10/31/2023Actions
Actions #1

Updated by Peter Amstutz almost 4 years ago

  • Assigned To set to Peter Amstutz
Actions #2

Updated by Peter Amstutz almost 4 years ago

  • Target version changed from 2020-05-06 Sprint to 2020-05-20 Sprint
Actions #3

Updated by Peter Amstutz almost 4 years ago

  • Target version changed from 2020-05-20 Sprint to 2020-06-03 Sprint
Actions #4

Updated by Peter Amstutz almost 4 years ago

  • Target version changed from 2020-06-03 Sprint to 2020-06-17 Sprint
Actions #5

Updated by Peter Amstutz almost 4 years ago

  • Target version changed from 2020-06-17 Sprint to 2020-07-01 Sprint
Actions #6

Updated by Peter Amstutz almost 4 years ago

  • Target version changed from 2020-07-01 Sprint to 2020-07-15
Actions #7

Updated by Peter Amstutz almost 4 years ago

  • Target version changed from 2020-07-15 to 2020-08-12 Sprint
Actions #8

Updated by Peter Amstutz almost 4 years ago

  • Related to Idea #16011: CWL support, docs, training, website added
Actions #9

Updated by Peter Amstutz almost 4 years ago

  • Target version changed from 2020-08-12 Sprint to 2020-08-26 Sprint
Actions #10

Updated by Peter Amstutz over 3 years ago

  • Target version changed from 2020-08-26 Sprint to 2020-09-09 Sprint
Actions #11

Updated by Peter Amstutz over 3 years ago

  • Target version changed from 2020-09-09 Sprint to 2020-09-23 Sprint
Actions #12

Updated by Peter Amstutz over 3 years ago

  • Target version changed from 2020-09-23 Sprint to 2020-10-07 Sprint
Actions #13

Updated by Peter Amstutz over 3 years ago

  • Target version changed from 2020-10-07 Sprint to 2020-10-21 Sprint
Actions #14

Updated by Peter Amstutz over 3 years ago

  • Target version changed from 2020-10-21 Sprint to 2020-11-04 Sprint
Actions #15

Updated by Peter Amstutz over 3 years ago

  • Target version changed from 2020-11-04 Sprint to 2020-11-18
Actions #16

Updated by Peter Amstutz over 3 years ago

  • Target version deleted (2020-11-18)
Actions #17

Updated by Peter Amstutz about 3 years ago

  • Target version set to 2021-03-31 sprint
Actions #18

Updated by Peter Amstutz about 3 years ago

  • Assigned To changed from Peter Amstutz to Jiayong Li
Actions #19

Updated by Peter Amstutz about 3 years ago

  • Target version changed from 2021-03-31 sprint to 2021-04-14 sprint
Actions #20

Updated by Peter Amstutz about 3 years ago

  • Target version changed from 2021-04-14 sprint to 2021-04-28 bughunt sprint
Actions #21

Updated by Peter Amstutz about 3 years ago

  • Target version deleted (2021-04-28 bughunt sprint)
Actions #22

Updated by Peter Amstutz almost 3 years ago

  • Related to Idea #17848: CWL runner improvements added
Actions #23

Updated by Peter Amstutz almost 3 years ago

  • Related to deleted (Idea #16011: CWL support, docs, training, website)
Actions #24

Updated by Peter Amstutz over 1 year ago

  • Target version set to 2022-11-09 sprint
Actions #25

Updated by Peter Amstutz over 1 year ago

  • Description updated (diff)
Actions #26

Updated by Jiayong Li over 1 year ago

  • Status changed from New to In Progress
  • Description updated (diff)
Actions #27

Updated by Peter Amstutz over 1 year ago

  • Target version changed from 2022-11-09 sprint to 2022-11-23 sprint
Actions #28

Updated by Peter Amstutz over 1 year ago

  • Target version changed from 2022-11-23 sprint to 2022-12-07 Sprint
Actions #29

Updated by Jiayong Li over 1 year ago

I'm trying to run the unit test "test_resource_requirements" using python3 virtualenv on tordo shell node. I got the following error.

$ python setup.py install

Installed /home/jli/env/acr/lib/python3.7/site-packages/arvados_cwl_runner-2.5.0.dev20221129154757-py3.7.egg
Processing dependencies for arvados-cwl-runner==2.5.0.dev20221129154757
error: pyparsing 3.0.9 is installed but pyparsing<3 is required by {'arvados-python-client'}
$ python setup.py test --test-suite=test.test_container.test_resource_requirements

Using /home/jli/git/arvados/sdk/cwl/.eggs/googleapis_common_protos-1.57.0-py3.7.egg
Traceback (most recent call last):
  File "setup.py", line 60, in <module>
    zip_safe=True,
  File "/home/jli/env/acr/lib/python3.7/site-packages/setuptools/__init__.py", line 145, in setup
    return distutils.core.setup(**attrs)
  File "/usr/lib/python3.7/distutils/core.py", line 148, in setup
    dist.run_commands()
  File "/usr/lib/python3.7/distutils/dist.py", line 966, in run_commands
    self.run_command(cmd)
  File "/usr/lib/python3.7/distutils/dist.py", line 985, in run_command
    cmd_obj.run()
  File "/home/jli/env/acr/lib/python3.7/site-packages/setuptools/command/test.py", line 216, in run
    installed_dists = self.install_dists(self.distribution)
  File "/home/jli/env/acr/lib/python3.7/site-packages/setuptools/command/test.py", line 207, in install_dists
    ir_d = dist.fetch_build_eggs(dist.install_requires)
  File "/home/jli/env/acr/lib/python3.7/site-packages/setuptools/dist.py", line 724, in fetch_build_eggs
    replace_conflicting=True,
  File "/home/jli/env/acr/lib/python3.7/site-packages/pkg_resources/__init__.py", line 791, in resolve
    raise VersionConflict(dist, req).with_context(dependent_req)
pkg_resources.ContextualVersionConflict: (google-auth 1.35.0 (/home/jli/git/arvados/sdk/cwl/.eggs/google_auth-1.35.0-py3.7.egg), Requirement.parse('google-auth<3.0dev,>=2.14.1'), {'google-api-core'})

1. Are these the right commands for unit testing?
2. Is pyparsing 3.0.9 the main issue here?

Actions #30

Updated by Peter Amstutz over 1 year ago

Try merging main, I think fixed this.

Actions #31

Updated by Jiayong Li over 1 year ago

After trying

$ pip install -e .

I get

$ python setup.py test --test-suite=test.test_container.test_resource_requirements

Installed /home/jli/git/arvados/sdk/cwl/.eggs/mock-3.0.5-py3.7.egg
running egg_info
writing arvados_cwl_runner.egg-info/PKG-INFO
writing dependency_links to arvados_cwl_runner.egg-info/dependency_links.txt
writing entry points to arvados_cwl_runner.egg-info/entry_points.txt
writing requirements to arvados_cwl_runner.egg-info/requires.txt
writing top-level names to arvados_cwl_runner.egg-info/top_level.txt
reading manifest file 'arvados_cwl_runner.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'
writing manifest file 'arvados_cwl_runner.egg-info/SOURCES.txt'
running build_ext
Traceback (most recent call last):
  File "setup.py", line 60, in <module>
    zip_safe=True,
  File "/home/jli/env/acr/lib/python3.7/site-packages/setuptools/__init__.py", line 145, in setup
    return distutils.core.setup(**attrs)
  File "/usr/lib/python3.7/distutils/core.py", line 148, in setup
    dist.run_commands()
  File "/usr/lib/python3.7/distutils/dist.py", line 966, in run_commands
    self.run_command(cmd)
  File "/usr/lib/python3.7/distutils/dist.py", line 985, in run_command
    cmd_obj.run()
  File "/home/jli/env/acr/lib/python3.7/site-packages/setuptools/command/test.py", line 227, in run
    with self.project_on_sys_path():
  File "/usr/lib/python3.7/contextlib.py", line 112, in __enter__
    return next(self.gen)
  File "/home/jli/env/acr/lib/python3.7/site-packages/setuptools/command/test.py", line 166, in project_on_sys_path
    require('%s==%s' % (ei_cmd.egg_name, ei_cmd.egg_version))
  File "/home/jli/env/acr/lib/python3.7/site-packages/pkg_resources/__init__.py", line 900, in require
    needed = self.resolve(parse_requirements(requirements))
  File "/home/jli/env/acr/lib/python3.7/site-packages/pkg_resources/__init__.py", line 791, in resolve
    raise VersionConflict(dist, req).with_context(dependent_req)
pkg_resources.ContextualVersionConflict: (google-auth 2.15.0 (/home/jli/git/arvados/sdk/cwl/.eggs/google_auth-2.15.0-py3.7.egg), Requirement.parse('google-auth<2'), {'arvados-python-client'})

Actions #32

Updated by Jiayong Li over 1 year ago

Ran "pip install -e ."
under both arvados/sdk/cwl and arvados/sdk/python

Now testing works but the test failed for some reason.

$ python setup.py test --test-suite=test.test_container.test_resource_requirements

test_container (unittest.loader._FailedTest) ... ERROR

======================================================================
ERROR: test_container (unittest.loader._FailedTest)
----------------------------------------------------------------------
ImportError: Failed to import test module: test_container
Traceback (most recent call last):
  File "/usr/lib/python3.7/unittest/loader.py", line 154, in loadTestsFromName
    module = __import__(module_name)
ModuleNotFoundError: No module named 'test.test_container'

----------------------------------------------------------------------
Ran 1 test in 0.000s

FAILED (errors=1)
Test failed: <unittest.runner.TextTestResult run=1 errors=1 failures=0>
error: Test failed: <unittest.runner.TextTestResult run=1 errors=1 failures=0>

Actions #33

Updated by Jiayong Li over 1 year ago

This command works for unit test.

$ python setup.py test --test-suite=tests.test_container.TestContainer.test_resource_requirements
Using /home/jli/git/arvados/sdk/python for version number calculation of /home/jli/git/arvados/sdk/cwl
running test
Searching for subprocess32>=3.5.1
Best match: subprocess32 3.5.4
Processing subprocess32-3.5.4-py3.7.egg

Using /home/jli/git/arvados/sdk/cwl/.eggs/subprocess32-3.5.4-py3.7.egg
Searching for mock<4,>=1.0
Best match: mock 3.0.5
Processing mock-3.0.5-py3.7.egg

Using /home/jli/git/arvados/sdk/cwl/.eggs/mock-3.0.5-py3.7.egg
running egg_info
writing arvados_cwl_runner.egg-info/PKG-INFO
writing dependency_links to arvados_cwl_runner.egg-info/dependency_links.txt
writing entry points to arvados_cwl_runner.egg-info/entry_points.txt
writing requirements to arvados_cwl_runner.egg-info/requires.txt
writing top-level names to arvados_cwl_runner.egg-info/top_level.txt
reading manifest file 'arvados_cwl_runner.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'
writing manifest file 'arvados_cwl_runner.egg-info/SOURCES.txt'
running build_ext
test_resource_requirements (tests.test_container.TestContainer) ... ok

----------------------------------------------------------------------
Ran 1 test in 2.149s

OK

Actions #34

Updated by Jiayong Li over 1 year ago

  • Target version changed from 2022-12-07 Sprint to 2022-12-21 Sprint

In the cwltool code, I'm trying to understand select_resources in action (method of class MultithreadedJobExecutor in executors.py), and I've found a mention of select_resources in process.py.

        if runtimeContext.select_resources is not None:
            # Call select resources hook
            return runtimeContext.select_resources(request_evaluated, runtimeContext)

What I find confusing is the following:
1. runtimeContext is an object of class RuntimeContext (from context.py), and select_resources is a method defined for MultithreadedJobExecutor class (from executor.py). I don't see how we can apply select_resources on runtimeContext.
2. runtimeContext appears as both as the object and the argument is confusing to me, what is this command doing on a higher level?

Actions #35

Updated by Peter Amstutz over 1 year ago

Jiayong Li wrote in #note-34:

In the cwltool code, I'm trying to understand select_resources in action (method of class MultithreadedJobExecutor in executors.py), and I've found a mention of select_resources in process.py.
[...]

What I find confusing is the following:
1. runtimeContext is an object of class RuntimeContext (from context.py), and select_resources is a method defined for MultithreadedJobExecutor class (from executor.py). I don't see how we can apply select_resources on runtimeContext.
2. runtimeContext appears as both as the object and the argument is confusing me, what is this command doing on a higher level?

  1. select_resources is a field on RuntimeContext of type "Callable", that means it is variable that holds something which is callable as a function. It isn't a method.
    1. MultithreadedJobExecutor.select_resources is a method that can be assigned to RuntimeContext.select_resources
    2. In python, you can assign "callable = object.method" and then invoking "callable()" will have the same behavior as calling "object.method()"
    3. For this task, we want to provide our own select_resources function or method, the way this is made available at the right place in cwltool is by assigning a custom select_resources field on RuntimeContext
  2. it is referencing select_resources as a variable, the value of which is a function, then calling the function. It passes the runtimeContext object to the function, because the function would not necessarily have it otherwise.
Actions #36

Updated by Jiayong Li over 1 year ago

The assignment happened in main.py

                temp_executor = MultithreadedJobExecutor()
                runtimeContext.select_resources = temp_executor.select_resources

Actions #37

Updated by Jiayong Li over 1 year ago

1. There is no explicit mention of select_resources from cwltool.executors in a-c-r code, how is it implicitly used?
2. How do I use select_resources from cwltool.executors in arvados_cwl.executor without copy/pasting?

Answer:
Currently we don't use select_resources hook (runtimeContext.select_resources is None), we use defaultReq as requirement.

process.py
evalResources

if runtimeContext.select_resources is not None:
            # Call select resources hook
            return runtimeContext.select_resources(request_evaluated, runtimeContext)

        defaultReq = {
            "cores": request_evaluated["coresMin"],
            "ram": math.ceil(request_evaluated["ramMin"]),
            "tmpdirSize": math.ceil(request_evaluated["tmpdirMin"]),
            "outdirSize": math.ceil(request_evaluated["outdirMin"]),
        }

Actions #38

Updated by Peter Amstutz over 1 year ago

  • Target version changed from 2022-12-21 Sprint to 2023-01-18 sprint
Actions #39

Updated by Peter Amstutz about 1 year ago

  • Target version changed from 2023-01-18 sprint to 2023-02-01 sprint
Actions #40

Updated by Peter Amstutz about 1 year ago

  • Story points set to 2.0
Actions #41

Updated by Peter Amstutz about 1 year ago

  • Target version changed from 2023-02-01 sprint to To be scheduled
  • Assigned To deleted (Jiayong Li)
Actions #42

Updated by Peter Amstutz about 1 year ago

  • Related to Idea #20273: More CWL runner improvements added
Actions #43

Updated by Peter Amstutz about 1 year ago

  • Related to deleted (Idea #17848: CWL runner improvements)
Actions #44

Updated by Peter Amstutz 10 months ago

  • Target version changed from To be scheduled to Development 2023-07-05 sprint
Actions #45

Updated by Peter Amstutz 10 months ago

  • Assigned To set to Alex Coleman
Actions #46

Updated by Peter Amstutz 10 months ago

  • Related to Idea #18179: Better spot instance support added
Actions #47

Updated by Peter Amstutz 10 months ago

  • Target version changed from Development 2023-07-05 sprint to Development 2023-07-19 sprint
Actions #48

Updated by Peter Amstutz 10 months ago

  • Target version changed from Development 2023-07-19 sprint to Development 2023-08-02 sprint
Actions #49

Updated by Peter Amstutz 9 months ago

  • Target version changed from Development 2023-08-02 sprint to Future
Actions #51

Updated by Brett Smith 4 months ago

Now that Crunch is gaining the ability to consider a range of instance types for a container, how are these features going to interact? e.g., in the example in the description, the workflow step says it can use anywhere between 4-16 cores. If select_resources boils this down to a request for 8 cores (the largest available in the range), does that mean Crunch won't consider running the container on 4-core nodes, even though the workflow step would support it?

Actions #52

Updated by Brett Smith 4 months ago

  • Related to Feature #20978: Support multiple candidate instance types to assign containers added
Actions #53

Updated by Peter Amstutz 4 months ago

Brett Smith wrote in #note-51:

Now that Crunch is gaining the ability to consider a range of instance types for a container, how are these features going to interact? e.g., in the example in the description, the workflow step says it can use anywhere between 4-16 cores. If select_resources boils this down to a request for 8 cores (the largest available in the range), does that mean Crunch won't consider running the container on 4-core nodes, even though the workflow step would support it?

Further complicating things, CWL lets you include runtime node selection parameters in the command line or environment, e.g. setting a --threads parameter equal to the number of CPU cores allocated. So pushing node selection over to the dispatcher (which is in the position to choose a lower spec node when a higher spec node isn't available) isn't ideal either.

This feature could use some more design work.

One thought that occurs to me would be to ask for maximum specs and set a maximum time you're willing to wait for an instance type once it has reached the top of the priority list, and have a new container failure state "unsatisfyable" that would tell a-c-r could retry at lower specs.

Actions

Also available in: Atom PDF