Idea #13925
openDefault keep cache scales with requested container size
Description
The default keep cache size is 256 MiB. For certain workloads, this is much too small. In particular, multithreaded workloads which read from multiple files experience severe cache contention. Unfortunately, it is difficult for users to analyze performance problems due to keep cache. Often times the response is simply to request more resources via runtime_constraints. However, because the keep cache does not scale with container/machine size, this does not have any effect.
Based on the observation that (a) users request more VCPUs for multithreaded workloads and (b) users' typical response to performance problems is to request more resources, we should scale the default keep cache based on runtime_constraints.
The cache should be either a percentage of RAM (say 12.5%) or multiplied by the number of cores, say 384 MiB per core.
This could be computed by a-c-r or on the API server.
Updated by Peter Amstutz about 6 years ago
- Status changed from New to In Progress
Updated by Peter Amstutz about 6 years ago
- Description updated (diff)
- Status changed from In Progress to New
Updated by Joshua Randall about 6 years ago
When you say "machine size" do you actually mean the vcpus allocated to the job? On our system we use slurm with consumable resources and cgroup limits, so multiple jobs are run on each machine.
Updated by Peter Amstutz about 6 years ago
Joshua Randall wrote:
When you say "machine size" do you actually mean the vcpus allocated to the job? On our system we use slurm with consumable resources and cgroup limits, so multiple jobs are run on each machine.
Good point. Yes, I was thinking it would be calculated on vcpus / RAM allocated to the job, since we don't have an actual machine size at the point of making the container request. But really I wrote this on the assumption of cloud nodes, where we try to allocate the best fit VM for the job and only run one job at a time per VM.
The intention is to make it so that the intuition of giving more resources to a slow tool would have some effect. However, if that means the user ends up asking for more cores / RAM that are not actually going to be used, that is quite wasteful.
Another way to go about this might be to use the warning mechanism under development in #13773 to report cache thrashing, and have a streamlined way of retrying with a bigger cache.
Updated by Peter Amstutz about 3 years ago
- Target version deleted (
To Be Groomed)
Updated by Tom Clegg almost 3 years ago
- Description updated (diff)
- Subject changed from Default keep cache scales with machine size to Default keep cache scales with requested container size