Feature #16316
opena-c-r handles resource range requests (especially CPU) and adjusts requests based on what is in InstanceTypes list
Description
Implement a version of select_resources for Arvados.
You can get a dictionary of instance types with this:
api.config()["InstanceTypes"]
The select_resources method should, at minimum, accept a range of CPU core values (e.g. coresMin: 4, coresMax: 16) and then check the available InstanceTypes and assign the greatest core count available. For example, if the system is only configured with 2, 4, and 8 core nodes, it should assign 8 cores since it is in the range (4 - 16).
RAM and disk can also have a range. Just return the minimum value for now (this is the existing behavior).
Tell cwltool to use your select_resources
method by setting the object field runtimeContext.select_resources
.
Related issues
Updated by Peter Amstutz over 4 years ago
- Target version changed from 2020-05-06 Sprint to 2020-05-20 Sprint
Updated by Peter Amstutz over 4 years ago
- Target version changed from 2020-05-20 Sprint to 2020-06-03 Sprint
Updated by Peter Amstutz over 4 years ago
- Target version changed from 2020-06-03 Sprint to 2020-06-17 Sprint
Updated by Peter Amstutz over 4 years ago
- Target version changed from 2020-06-17 Sprint to 2020-07-01 Sprint
Updated by Peter Amstutz over 4 years ago
- Target version changed from 2020-07-01 Sprint to 2020-07-15
Updated by Peter Amstutz over 4 years ago
- Target version changed from 2020-07-15 to 2020-08-12 Sprint
Updated by Peter Amstutz over 4 years ago
- Related to Idea #16011: CWL support, docs, training, website added
Updated by Peter Amstutz about 4 years ago
- Target version changed from 2020-08-12 Sprint to 2020-08-26 Sprint
Updated by Peter Amstutz about 4 years ago
- Target version changed from 2020-08-26 Sprint to 2020-09-09 Sprint
Updated by Peter Amstutz about 4 years ago
- Target version changed from 2020-09-09 Sprint to 2020-09-23 Sprint
Updated by Peter Amstutz about 4 years ago
- Target version changed from 2020-09-23 Sprint to 2020-10-07 Sprint
Updated by Peter Amstutz about 4 years ago
- Target version changed from 2020-10-07 Sprint to 2020-10-21 Sprint
Updated by Peter Amstutz almost 4 years ago
- Target version changed from 2020-10-21 Sprint to 2020-11-04 Sprint
Updated by Peter Amstutz almost 4 years ago
- Target version changed from 2020-11-04 Sprint to 2020-11-18
Updated by Peter Amstutz almost 4 years ago
- Target version deleted (
2020-11-18)
Updated by Peter Amstutz over 3 years ago
- Target version set to 2021-03-31 sprint
Updated by Peter Amstutz over 3 years ago
- Assigned To changed from Peter Amstutz to Jiayong Li
Updated by Peter Amstutz over 3 years ago
- Target version changed from 2021-03-31 sprint to 2021-04-14 sprint
Updated by Peter Amstutz over 3 years ago
- Target version changed from 2021-04-14 sprint to 2021-04-28 bughunt sprint
Updated by Peter Amstutz over 3 years ago
- Target version deleted (
2021-04-28 bughunt sprint)
Updated by Peter Amstutz over 3 years ago
- Related to Idea #17848: CWL runner improvements added
Updated by Peter Amstutz over 3 years ago
- Related to deleted (Idea #16011: CWL support, docs, training, website)
Updated by Peter Amstutz almost 2 years ago
- Target version set to 2022-11-09 sprint
Updated by Jiayong Li almost 2 years ago
- Status changed from New to In Progress
- Description updated (diff)
Updated by Peter Amstutz almost 2 years ago
- Target version changed from 2022-11-09 sprint to 2022-11-23 sprint
Updated by Peter Amstutz almost 2 years ago
- Target version changed from 2022-11-23 sprint to 2022-12-07 Sprint
Updated by Jiayong Li almost 2 years ago
I'm trying to run the unit test "test_resource_requirements" using python3 virtualenv on tordo shell node. I got the following error.
$ python setup.py install Installed /home/jli/env/acr/lib/python3.7/site-packages/arvados_cwl_runner-2.5.0.dev20221129154757-py3.7.egg Processing dependencies for arvados-cwl-runner==2.5.0.dev20221129154757 error: pyparsing 3.0.9 is installed but pyparsing<3 is required by {'arvados-python-client'}
$ python setup.py test --test-suite=test.test_container.test_resource_requirements Using /home/jli/git/arvados/sdk/cwl/.eggs/googleapis_common_protos-1.57.0-py3.7.egg Traceback (most recent call last): File "setup.py", line 60, in <module> zip_safe=True, File "/home/jli/env/acr/lib/python3.7/site-packages/setuptools/__init__.py", line 145, in setup return distutils.core.setup(**attrs) File "/usr/lib/python3.7/distutils/core.py", line 148, in setup dist.run_commands() File "/usr/lib/python3.7/distutils/dist.py", line 966, in run_commands self.run_command(cmd) File "/usr/lib/python3.7/distutils/dist.py", line 985, in run_command cmd_obj.run() File "/home/jli/env/acr/lib/python3.7/site-packages/setuptools/command/test.py", line 216, in run installed_dists = self.install_dists(self.distribution) File "/home/jli/env/acr/lib/python3.7/site-packages/setuptools/command/test.py", line 207, in install_dists ir_d = dist.fetch_build_eggs(dist.install_requires) File "/home/jli/env/acr/lib/python3.7/site-packages/setuptools/dist.py", line 724, in fetch_build_eggs replace_conflicting=True, File "/home/jli/env/acr/lib/python3.7/site-packages/pkg_resources/__init__.py", line 791, in resolve raise VersionConflict(dist, req).with_context(dependent_req) pkg_resources.ContextualVersionConflict: (google-auth 1.35.0 (/home/jli/git/arvados/sdk/cwl/.eggs/google_auth-1.35.0-py3.7.egg), Requirement.parse('google-auth<3.0dev,>=2.14.1'), {'google-api-core'})
1. Are these the right commands for unit testing?
2. Is pyparsing 3.0.9 the main issue here?
Updated by Peter Amstutz almost 2 years ago
Try merging main, I think fixed this.
Updated by Jiayong Li almost 2 years ago
After trying
$ pip install -e .
I get
$ python setup.py test --test-suite=test.test_container.test_resource_requirements Installed /home/jli/git/arvados/sdk/cwl/.eggs/mock-3.0.5-py3.7.egg running egg_info writing arvados_cwl_runner.egg-info/PKG-INFO writing dependency_links to arvados_cwl_runner.egg-info/dependency_links.txt writing entry points to arvados_cwl_runner.egg-info/entry_points.txt writing requirements to arvados_cwl_runner.egg-info/requires.txt writing top-level names to arvados_cwl_runner.egg-info/top_level.txt reading manifest file 'arvados_cwl_runner.egg-info/SOURCES.txt' reading manifest template 'MANIFEST.in' writing manifest file 'arvados_cwl_runner.egg-info/SOURCES.txt' running build_ext Traceback (most recent call last): File "setup.py", line 60, in <module> zip_safe=True, File "/home/jli/env/acr/lib/python3.7/site-packages/setuptools/__init__.py", line 145, in setup return distutils.core.setup(**attrs) File "/usr/lib/python3.7/distutils/core.py", line 148, in setup dist.run_commands() File "/usr/lib/python3.7/distutils/dist.py", line 966, in run_commands self.run_command(cmd) File "/usr/lib/python3.7/distutils/dist.py", line 985, in run_command cmd_obj.run() File "/home/jli/env/acr/lib/python3.7/site-packages/setuptools/command/test.py", line 227, in run with self.project_on_sys_path(): File "/usr/lib/python3.7/contextlib.py", line 112, in __enter__ return next(self.gen) File "/home/jli/env/acr/lib/python3.7/site-packages/setuptools/command/test.py", line 166, in project_on_sys_path require('%s==%s' % (ei_cmd.egg_name, ei_cmd.egg_version)) File "/home/jli/env/acr/lib/python3.7/site-packages/pkg_resources/__init__.py", line 900, in require needed = self.resolve(parse_requirements(requirements)) File "/home/jli/env/acr/lib/python3.7/site-packages/pkg_resources/__init__.py", line 791, in resolve raise VersionConflict(dist, req).with_context(dependent_req) pkg_resources.ContextualVersionConflict: (google-auth 2.15.0 (/home/jli/git/arvados/sdk/cwl/.eggs/google_auth-2.15.0-py3.7.egg), Requirement.parse('google-auth<2'), {'arvados-python-client'})
Updated by Jiayong Li almost 2 years ago
Ran "pip install -e ."
under both arvados/sdk/cwl and arvados/sdk/python
Now testing works but the test failed for some reason.
$ python setup.py test --test-suite=test.test_container.test_resource_requirements test_container (unittest.loader._FailedTest) ... ERROR ====================================================================== ERROR: test_container (unittest.loader._FailedTest) ---------------------------------------------------------------------- ImportError: Failed to import test module: test_container Traceback (most recent call last): File "/usr/lib/python3.7/unittest/loader.py", line 154, in loadTestsFromName module = __import__(module_name) ModuleNotFoundError: No module named 'test.test_container' ---------------------------------------------------------------------- Ran 1 test in 0.000s FAILED (errors=1) Test failed: <unittest.runner.TextTestResult run=1 errors=1 failures=0> error: Test failed: <unittest.runner.TextTestResult run=1 errors=1 failures=0>
Updated by Jiayong Li almost 2 years ago
This command works for unit test.
$ python setup.py test --test-suite=tests.test_container.TestContainer.test_resource_requirements Using /home/jli/git/arvados/sdk/python for version number calculation of /home/jli/git/arvados/sdk/cwl running test Searching for subprocess32>=3.5.1 Best match: subprocess32 3.5.4 Processing subprocess32-3.5.4-py3.7.egg Using /home/jli/git/arvados/sdk/cwl/.eggs/subprocess32-3.5.4-py3.7.egg Searching for mock<4,>=1.0 Best match: mock 3.0.5 Processing mock-3.0.5-py3.7.egg Using /home/jli/git/arvados/sdk/cwl/.eggs/mock-3.0.5-py3.7.egg running egg_info writing arvados_cwl_runner.egg-info/PKG-INFO writing dependency_links to arvados_cwl_runner.egg-info/dependency_links.txt writing entry points to arvados_cwl_runner.egg-info/entry_points.txt writing requirements to arvados_cwl_runner.egg-info/requires.txt writing top-level names to arvados_cwl_runner.egg-info/top_level.txt reading manifest file 'arvados_cwl_runner.egg-info/SOURCES.txt' reading manifest template 'MANIFEST.in' writing manifest file 'arvados_cwl_runner.egg-info/SOURCES.txt' running build_ext test_resource_requirements (tests.test_container.TestContainer) ... ok ---------------------------------------------------------------------- Ran 1 test in 2.149s OK
Updated by Jiayong Li almost 2 years ago
- Target version changed from 2022-12-07 Sprint to 2022-12-21 Sprint
In the cwltool code, I'm trying to understand select_resources in action (method of class MultithreadedJobExecutor in executors.py), and I've found a mention of select_resources in process.py.
if runtimeContext.select_resources is not None: # Call select resources hook return runtimeContext.select_resources(request_evaluated, runtimeContext)
What I find confusing is the following:
1. runtimeContext is an object of class RuntimeContext (from context.py), and select_resources is a method defined for MultithreadedJobExecutor class (from executor.py). I don't see how we can apply select_resources on runtimeContext.
2. runtimeContext appears as both as the object and the argument is confusing to me, what is this command doing on a higher level?
Updated by Peter Amstutz almost 2 years ago
Jiayong Li wrote in #note-34:
In the cwltool code, I'm trying to understand select_resources in action (method of class MultithreadedJobExecutor in executors.py), and I've found a mention of select_resources in process.py.
[...]What I find confusing is the following:
1. runtimeContext is an object of class RuntimeContext (from context.py), and select_resources is a method defined for MultithreadedJobExecutor class (from executor.py). I don't see how we can apply select_resources on runtimeContext.
2. runtimeContext appears as both as the object and the argument is confusing me, what is this command doing on a higher level?
- select_resources is a field on RuntimeContext of type "Callable", that means it is variable that holds something which is callable as a function. It isn't a method.
- MultithreadedJobExecutor.select_resources is a method that can be assigned to RuntimeContext.select_resources
- In python, you can assign "callable = object.method" and then invoking "callable()" will have the same behavior as calling "object.method()"
- For this task, we want to provide our own select_resources function or method, the way this is made available at the right place in cwltool is by assigning a custom select_resources field on RuntimeContext
- it is referencing select_resources as a variable, the value of which is a function, then calling the function. It passes the runtimeContext object to the function, because the function would not necessarily have it otherwise.
Updated by Jiayong Li almost 2 years ago
The assignment happened in main.py
temp_executor = MultithreadedJobExecutor() runtimeContext.select_resources = temp_executor.select_resources
Updated by Jiayong Li almost 2 years ago
1. There is no explicit mention of select_resources from cwltool.executors in a-c-r code, how is it implicitly used?
2. How do I use select_resources from cwltool.executors in arvados_cwl.executor without copy/pasting?
Answer:
Currently we don't use select_resources hook (runtimeContext.select_resources is None), we use defaultReq as requirement.
process.py
evalResources
if runtimeContext.select_resources is not None: # Call select resources hook return runtimeContext.select_resources(request_evaluated, runtimeContext) defaultReq = { "cores": request_evaluated["coresMin"], "ram": math.ceil(request_evaluated["ramMin"]), "tmpdirSize": math.ceil(request_evaluated["tmpdirMin"]), "outdirSize": math.ceil(request_evaluated["outdirMin"]), }
Updated by Peter Amstutz almost 2 years ago
- Target version changed from 2022-12-21 Sprint to 2023-01-18 sprint
Updated by Peter Amstutz over 1 year ago
- Target version changed from 2023-01-18 sprint to 2023-02-01 sprint
Updated by Peter Amstutz over 1 year ago
- Target version changed from 2023-02-01 sprint to To be scheduled
- Assigned To deleted (
Jiayong Li)
Updated by Peter Amstutz over 1 year ago
- Related to Idea #20273: More CWL runner improvements added
Updated by Peter Amstutz over 1 year ago
- Related to deleted (Idea #17848: CWL runner improvements)
Updated by Peter Amstutz over 1 year ago
- Target version changed from To be scheduled to Development 2023-07-05 sprint
Updated by Peter Amstutz over 1 year ago
- Related to Idea #18179: Better spot instance support added
Updated by Peter Amstutz over 1 year ago
- Target version changed from Development 2023-07-05 sprint to Development 2023-07-19 sprint
Updated by Peter Amstutz over 1 year ago
- Target version changed from Development 2023-07-19 sprint to Development 2023-08-02 sprint
Updated by Peter Amstutz about 1 year ago
- Target version changed from Development 2023-08-02 sprint to Future
Updated by Brett Smith 10 months ago
Now that Crunch is gaining the ability to consider a range of instance types for a container, how are these features going to interact? e.g., in the example in the description, the workflow step says it can use anywhere between 4-16 cores. If select_resources
boils this down to a request for 8 cores (the largest available in the range), does that mean Crunch won't consider running the container on 4-core nodes, even though the workflow step would support it?
Updated by Brett Smith 10 months ago
- Related to Feature #20978: Support multiple candidate instance types to assign containers added
Updated by Peter Amstutz 10 months ago
Brett Smith wrote in #note-51:
Now that Crunch is gaining the ability to consider a range of instance types for a container, how are these features going to interact? e.g., in the example in the description, the workflow step says it can use anywhere between 4-16 cores. If
select_resources
boils this down to a request for 8 cores (the largest available in the range), does that mean Crunch won't consider running the container on 4-core nodes, even though the workflow step would support it?
Further complicating things, CWL lets you include runtime node selection parameters in the command line or environment, e.g. setting a --threads
parameter equal to the number of CPU cores allocated. So pushing node selection over to the dispatcher (which is in the position to choose a lower spec node when a higher spec node isn't available) isn't ideal either.
This feature could use some more design work.
One thought that occurs to me would be to ask for maximum specs and set a maximum time you're willing to wait for an instance type once it has reached the top of the priority list, and have a new container failure state "unsatisfyable" that would tell a-c-r could retry at lower specs.