Bug #19981
openContainers that used an old DefaultKeepCacheRAM no longer get reused after a configuration change
Description
The Bug
beagle.cwl has the resource requirement
ResourceRequirement: coresMin: 2 ramMin: 10000
A new run:
https://workbench.2xpu4.arvadosapi.com/container_requests/2xpu4-xvhdp-ph1xry8mxbsol3j
An old run:
https://workbench.2xpu4.arvadosapi.com/container_requests/2xpu4-xvhdp-p571e0xq4g85ac7
The resource requirement didn't change, neither was keep_cache requirement specified. The recent run didn't reuse old run, since there is the following difference.
new runtime_constraints:
keep_cache_disk 10485760000 keep_cache_ram 0 ram 10485760000 vcpus 2
new node type:
"ProviderType": "m5.8xlarge", "VCPUs": 32, "RAM": 137438953472, "IncludedScratch": 4000000000, "AddedScratch": 100000000000, "Price": 1.542,
old runtime_constraints:
keep_cache_disk 0 keep_cache_ram 268435456 ram 10485760000 vcpus 2
old node type:
"ProviderType": "m5.xlarge", "VCPUs": 4, "RAM": 17179869184, "IncludedScratch": 4000000000, "AddedScratch": 0, "Price": 0.192,
The Fix
This happened because we changed the DefaultKeepCacheRAM
setting on the cluster, to start using disk cache instead of memory. As a consequence, Container.find_reusable
can no longer find containers that used the old default, because it searches for matching runtime_constraints
with a hash match, and it doesn't know what the old value of DefaultKeepCacheRAM
was to search for.
Ideally we would like to exclude the Keep cache constraints from reuse entirely but in order to do that we need some change to the way we store runtime_constraints
in the database—right now it's just plain text. Ideas that have been suggested:
- Convert the column to
jsonb
and do richer queries on it (Brett in note-14) - Add a column
reusable_runtime_constraints
that's limited to recording the constraints that affect reuse (Tom in note-15)
Agree on one and implement it.
Updated by Brett Smith almost 2 years ago
- Related to Feature #18842: Local disk keep cache for Python SDK/arv-mount added
Updated by Brett Smith almost 2 years ago
- Status changed from In Progress to New
- Category set to API
- Subject changed from arvados-cwl-runner schedules keep cache disk when the cwl does not to Default keep_cache_disk runtime parameter prevents reuse of containers that predate the constraint
This is a maybe-unintended consequence of changes made on purpose in #18842. As of Arvados 2.5.0 you can now choose whether your containers use a Keep cache backed by disk or RAM. If you do not specify either, the API server will generate a keep_cache_disk
constraint based on the RAM available on the instance type (745a7e253326a2b8e292dbd928c76320ce320216).
It looks like there was an attempt to exclude keep_cache_disk
from reuse considerations in 2b6f6f61757ad01f7d8baf63b190c1e1cd42db41, but I'm not sure it works as intended when you're comparing a container that was created before this default constraint was introduced to one that was created after.
In the meantime, if you want to reuse jobs, you can work around this by adding a keep_cache_ram
runtime constraint to your workflow step that matches the old one (268435456). Sorry for the trouble.
Updated by Brett Smith almost 2 years ago
- Subject changed from Default keep_cache_disk runtime parameter prevents reuse of containers that predate the constraint to Default keep_cache_disk runtime constraint prevents reuse of containers that predate the constraint
Updated by Brett Smith almost 2 years ago
- Release deleted (
60) - Target version set to Future
Like I said elsewhere, this is a bug where we either fix it soon, or we don't bother fixing it.
Updated by Brett Smith almost 2 years ago
If we decide not to fix it we should at least add a 2.5.0 release note to highlight the impact on container reuse.
Updated by Peter Amstutz almost 2 years ago
- Story points set to 1.0
- Assigned To set to Brett Smith
- Description updated (diff)
Updated by Brett Smith almost 2 years ago
- Target version changed from Future to To be scheduled
- Description updated (diff)
Updated by Brett Smith almost 2 years ago
- Target version changed from To be scheduled to Development 2023-03-15 sprint
- Status changed from New to In Progress
Updated by Brett Smith almost 2 years ago
19981-reuse-flex-keep-cache @ 599e2e6798a2497ed56b6dfe917a09260dd74407 - developer-run-tests: #3530
Since I was just thinking about container reuse in #16325 and this was bugging me, I figured I could knock it out real quick. lol famous last words.
Runtime constraints are stored as text
in the database, so the way this all works now is Container.find_reusable
normalizes the hash, MD5's it, then looks for matches that way. The hashing precludes being selective about constraints.
Even if we undid the hashing, which would be a big change by itself, the only option we have to ignore specific constraints is by loose text matching, which I basically never want to do.
IMO we should convert this column to jsonb
and then do rich matching on it, but that's obviously a bigger story that needs more discussion and agreement up front.
So, I have a branch up that generates variations of the runtime constraints by filling in common values for the Keep cache constraints, and considers them all eligible for reuse.
This does expand reuse in the direction that we want. It's also something that we're basically doing already: it's the strategy that we're using to look for containers that predate cuda
and keep_cache_disk
, just expanded to more values conceptually.
The main con of this approach is it doesn't even resolve the original bug. The real issue here isn't that the API server can't find containers that predate keep_cache_disk
; it's that the configuration of DefaultKeepCacheRAM
changed. Old containers used keep_cache_ram>0
, and after the change new requests have keep_cache_ram=0
, and that precludes a match. And it still does, even after this branch.
The cheapest way to address that would be to give administrators a configuration knob for "previous keep_cache_ram
defaults to search for." With this branch, it would be easy to include those values in the search. But that also requires more discussion, and maybe it would be better to just rip the band-aid off and make a bigger change like the jsonb
switch.
On balance, I think, I hope, that this branch is still a net win. But it's admittedly a lot more marginal than I hoped it would be when I started working on it.
Updated by Tom Clegg almost 2 years ago
This branch LGTM, thanks.
For a longer term solution, another possibility -- less flexible but more efficient than rich matching on json -- is an explicit reusable_runtime_constraints
(or just ..._md5
) column, with only the keys we really want to match on. When we change the rules, this column could be rewritten in a migration.
(We could almost just rewrite runtime_constraints
for old containers, but I think that would be too obnoxious since it would no longer serve as the record of what was actually done.)
Updated by Brett Smith almost 2 years ago
- Story points deleted (
1.0) - Target version deleted (
Development 2023-03-15 sprint) - Assigned To deleted (
Brett Smith) - Status changed from In Progress to New
- Description updated (diff)
- Subject changed from Default keep_cache_disk runtime constraint prevents reuse of containers that predate the constraint to Containers that used an old DefaultKeepCacheRAM no longer get reused after a configuration change
Updated by Brett Smith almost 2 years ago
Tom Clegg wrote in #note-15:
For a longer term solution, another possibility -- less flexible but more efficient than rich matching on json -- is an explicit
reusable_runtime_constraints
(or just..._md5
) column, with only the keys we really want to match on. When we change the rules, this column could be rewritten in a migration.
This makes sense to me too and I would be fine with it.
Since my branch doesn't actually fix the bug, I have updated the subject following what I learned, and the proposed fix to capture these ideas.
(We could almost just rewrite
runtime_constraints
for old containers, but I think that would be too obnoxious since it would no longer serve as the record of what was actually done.)
I agree this would be user-hostile. People want to go back and look up "what was the X setting I used on the container that worked?" so we don't want to lose that information.
Updated by Brett Smith almost 2 years ago
- Related to Bug #19884: keep_cache_disk runtime constraint is undocumented added
Updated by Peter Amstutz almost 2 years ago
- Release deleted (
57) - Target version set to Future
I'm moving this off of 2.6.0 because I don't want to block on it, but we should decide if we're going to do anything else.