Project

General

Profile

Actions

Idea #13925

open

Default keep cache scales with requested container size

Added by Peter Amstutz over 5 years ago. Updated about 2 months ago.

Status:
New
Priority:
Normal
Assigned To:
-
Category:
-
Target version:
Start date:
Due date:
Story points:
-
Release:
Release relationship:
Auto

Description

The default keep cache size is 256 MiB. For certain workloads, this is much too small. In particular, multithreaded workloads which read from multiple files experience severe cache contention. Unfortunately, it is difficult for users to analyze performance problems due to keep cache. Often times the response is simply to request more resources via runtime_constraints. However, because the keep cache does not scale with container/machine size, this does not have any effect.

Based on the observation that (a) users request more VCPUs for multithreaded workloads and (b) users' typical response to performance problems is to request more resources, we should scale the default keep cache based on runtime_constraints.

The cache should be either a percentage of RAM (say 12.5%) or multiplied by the number of cores, say 384 MiB per core.

This could be computed by a-c-r or on the API server.

Actions #1

Updated by Peter Amstutz over 5 years ago

  • Status changed from New to In Progress
Actions #2

Updated by Peter Amstutz over 5 years ago

  • Description updated (diff)
  • Status changed from In Progress to New
Actions #3

Updated by Joshua Randall over 5 years ago

When you say "machine size" do you actually mean the vcpus allocated to the job? On our system we use slurm with consumable resources and cgroup limits, so multiple jobs are run on each machine.

Actions #4

Updated by Peter Amstutz over 5 years ago

Joshua Randall wrote:

When you say "machine size" do you actually mean the vcpus allocated to the job? On our system we use slurm with consumable resources and cgroup limits, so multiple jobs are run on each machine.

Good point. Yes, I was thinking it would be calculated on vcpus / RAM allocated to the job, since we don't have an actual machine size at the point of making the container request. But really I wrote this on the assumption of cloud nodes, where we try to allocate the best fit VM for the job and only run one job at a time per VM.

The intention is to make it so that the intuition of giving more resources to a slow tool would have some effect. However, if that means the user ends up asking for more cores / RAM that are not actually going to be used, that is quite wasteful.

Another way to go about this might be to use the warning mechanism under development in #13773 to report cache thrashing, and have a streamlined way of retrying with a bigger cache.

Actions #5

Updated by Peter Amstutz almost 3 years ago

  • Target version deleted (To Be Groomed)
Actions #6

Updated by Tom Clegg over 2 years ago

  • Description updated (diff)
  • Subject changed from Default keep cache scales with machine size to Default keep cache scales with requested container size
Actions #7

Updated by Peter Amstutz about 1 year ago

  • Release set to 60
Actions #8

Updated by Peter Amstutz about 2 months ago

  • Target version set to Future
Actions

Also available in: Atom PDF