Project

General

Profile

Actions

Bug #19981

open

Containers that used an old DefaultKeepCacheRAM no longer get reused after a configuration change

Added by Jiayong Li about 1 year ago. Updated 11 months ago.

Status:
New
Priority:
Normal
Assigned To:
-
Category:
API
Target version:
Start date:
03/05/2023
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
-

Description

The Bug

beagle.cwl has the resource requirement

  ResourceRequirement:
    coresMin: 2
    ramMin: 10000

A new run:
https://workbench.2xpu4.arvadosapi.com/container_requests/2xpu4-xvhdp-ph1xry8mxbsol3j

An old run:
https://workbench.2xpu4.arvadosapi.com/container_requests/2xpu4-xvhdp-p571e0xq4g85ac7

The resource requirement didn't change, neither was keep_cache requirement specified. The recent run didn't reuse old run, since there is the following difference.

new runtime_constraints:

keep_cache_disk    10485760000
keep_cache_ram    0
ram    10485760000
vcpus    2

new node type:

"ProviderType": "m5.8xlarge",
"VCPUs": 32,
"RAM": 137438953472,
"IncludedScratch": 4000000000,
"AddedScratch": 100000000000,
"Price": 1.542,

old runtime_constraints:

keep_cache_disk    0
keep_cache_ram    268435456
ram    10485760000
vcpus    2

old node type:

"ProviderType": "m5.xlarge",
"VCPUs": 4,
"RAM": 17179869184,
"IncludedScratch": 4000000000,
"AddedScratch": 0,
"Price": 0.192,

The Fix

This happened because we changed the DefaultKeepCacheRAM setting on the cluster, to start using disk cache instead of memory. As a consequence, Container.find_reusable can no longer find containers that used the old default, because it searches for matching runtime_constraints with a hash match, and it doesn't know what the old value of DefaultKeepCacheRAM was to search for.

Ideally we would like to exclude the Keep cache constraints from reuse entirely but in order to do that we need some change to the way we store runtime_constraints in the database—right now it's just plain text. Ideas that have been suggested:

  • Convert the column to jsonb and do richer queries on it (Brett in note-14)
  • Add a column reusable_runtime_constraints that's limited to recording the constraints that affect reuse (Tom in note-15)

Agree on one and implement it.


Subtasks 1 (0 open1 closed)

Task #20202: Review 19981-reuse-flex-keep-cacheResolvedBrett Smith03/05/2023

Actions

Related issues

Related to Arvados - Feature #18842: Local disk keep cache for Python SDK/arv-mountResolvedPeter Amstutz10/21/2022

Actions
Related to Arvados - Bug #19884: keep_cache_disk runtime constraint is undocumentedResolvedBrett Smith

Actions
Actions #1

Updated by Jiayong Li about 1 year ago

  • Status changed from New to In Progress
Actions #2

Updated by Jiayong Li about 1 year ago

  • Description updated (diff)
Actions #3

Updated by Jiayong Li about 1 year ago

  • Description updated (diff)
Actions #5

Updated by Brett Smith about 1 year ago

  • Related to Feature #18842: Local disk keep cache for Python SDK/arv-mount added
Actions #6

Updated by Brett Smith about 1 year ago

  • Status changed from In Progress to New
  • Category set to API
  • Subject changed from arvados-cwl-runner schedules keep cache disk when the cwl does not to Default keep_cache_disk runtime parameter prevents reuse of containers that predate the constraint

This is a maybe-unintended consequence of changes made on purpose in #18842. As of Arvados 2.5.0 you can now choose whether your containers use a Keep cache backed by disk or RAM. If you do not specify either, the API server will generate a keep_cache_disk constraint based on the RAM available on the instance type (745a7e253326a2b8e292dbd928c76320ce320216).

It looks like there was an attempt to exclude keep_cache_disk from reuse considerations in 2b6f6f61757ad01f7d8baf63b190c1e1cd42db41, but I'm not sure it works as intended when you're comparing a container that was created before this default constraint was introduced to one that was created after.

In the meantime, if you want to reuse jobs, you can work around this by adding a keep_cache_ram runtime constraint to your workflow step that matches the old one (268435456). Sorry for the trouble.

Actions #7

Updated by Brett Smith about 1 year ago

  • Subject changed from Default keep_cache_disk runtime parameter prevents reuse of containers that predate the constraint to Default keep_cache_disk runtime constraint prevents reuse of containers that predate the constraint
Actions #8

Updated by Peter Amstutz about 1 year ago

  • Release set to 60
Actions #9

Updated by Brett Smith about 1 year ago

  • Release deleted (60)
  • Target version set to To be groomed

Like I said elsewhere, this is a bug where we either fix it soon, or we don't bother fixing it.

Actions #10

Updated by Brett Smith about 1 year ago

If we decide not to fix it we should at least add a 2.5.0 release note to highlight the impact on container reuse.

Actions #11

Updated by Peter Amstutz about 1 year ago

  • Story points set to 1.0
  • Assigned To set to Brett Smith
  • Description updated (diff)
Actions #12

Updated by Brett Smith about 1 year ago

  • Target version changed from To be groomed to To be scheduled
  • Description updated (diff)
Actions #13

Updated by Brett Smith 12 months ago

  • Target version changed from To be scheduled to Development 2023-03-15 sprint
  • Status changed from New to In Progress
Actions #14

Updated by Brett Smith 12 months ago

19981-reuse-flex-keep-cache @ 599e2e6798a2497ed56b6dfe917a09260dd74407 - developer-run-tests: #3530

Since I was just thinking about container reuse in #16325 and this was bugging me, I figured I could knock it out real quick. lol famous last words.

Runtime constraints are stored as text in the database, so the way this all works now is Container.find_reusable normalizes the hash, MD5's it, then looks for matches that way. The hashing precludes being selective about constraints.

Even if we undid the hashing, which would be a big change by itself, the only option we have to ignore specific constraints is by loose text matching, which I basically never want to do.

IMO we should convert this column to jsonb and then do rich matching on it, but that's obviously a bigger story that needs more discussion and agreement up front.

So, I have a branch up that generates variations of the runtime constraints by filling in common values for the Keep cache constraints, and considers them all eligible for reuse.

This does expand reuse in the direction that we want. It's also something that we're basically doing already: it's the strategy that we're using to look for containers that predate cuda and keep_cache_disk, just expanded to more values conceptually.

The main con of this approach is it doesn't even resolve the original bug. The real issue here isn't that the API server can't find containers that predate keep_cache_disk; it's that the configuration of DefaultKeepCacheRAM changed. Old containers used keep_cache_ram>0, and after the change new requests have keep_cache_ram=0, and that precludes a match. And it still does, even after this branch.

The cheapest way to address that would be to give administrators a configuration knob for "previous keep_cache_ram defaults to search for." With this branch, it would be easy to include those values in the search. But that also requires more discussion, and maybe it would be better to just rip the band-aid off and make a bigger change like the jsonb switch.

On balance, I think, I hope, that this branch is still a net win. But it's admittedly a lot more marginal than I hoped it would be when I started working on it.

Actions #15

Updated by Tom Clegg 12 months ago

This branch LGTM, thanks.

For a longer term solution, another possibility -- less flexible but more efficient than rich matching on json -- is an explicit reusable_runtime_constraints (or just ..._md5) column, with only the keys we really want to match on. When we change the rules, this column could be rewritten in a migration.

(We could almost just rewrite runtime_constraints for old containers, but I think that would be too obnoxious since it would no longer serve as the record of what was actually done.)

Actions #16

Updated by Brett Smith 12 months ago

  • Story points deleted (1.0)
  • Target version deleted (Development 2023-03-15 sprint)
  • Assigned To deleted (Brett Smith)
  • Status changed from In Progress to New
  • Description updated (diff)
  • Subject changed from Default keep_cache_disk runtime constraint prevents reuse of containers that predate the constraint to Containers that used an old DefaultKeepCacheRAM no longer get reused after a configuration change
Actions #17

Updated by Brett Smith 12 months ago

Tom Clegg wrote in #note-15:

For a longer term solution, another possibility -- less flexible but more efficient than rich matching on json -- is an explicit reusable_runtime_constraints (or just ..._md5) column, with only the keys we really want to match on. When we change the rules, this column could be rewritten in a migration.

This makes sense to me too and I would be fine with it.

Since my branch doesn't actually fix the bug, I have updated the subject following what I learned, and the proposed fix to capture these ideas.

(We could almost just rewrite runtime_constraints for old containers, but I think that would be too obnoxious since it would no longer serve as the record of what was actually done.)

I agree this would be user-hostile. People want to go back and look up "what was the X setting I used on the container that worked?" so we don't want to lose that information.

Actions #18

Updated by Brett Smith 12 months ago

  • Related to Bug #19884: keep_cache_disk runtime constraint is undocumented added
Actions #19

Updated by Peter Amstutz 12 months ago

  • Release set to 57
Actions #20

Updated by Peter Amstutz 11 months ago

  • Release deleted (57)
  • Target version set to To be groomed

I'm moving this off of 2.6.0 because I don't want to block on it, but we should decide if we're going to do anything else.

Actions

Also available in: Atom PDF