Project

General

Profile

Actions

Bug #12568

closed

[CWL] Cap size of collection cache

Added by Peter Amstutz over 6 years ago. Updated over 6 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Story points:
-

Description

The collection cache in arvados-cwl-runner is unbounded. When running workflows with many very large collections, this can result in gigabytes of RAM usage (requiring larger nodes for the workflow runner) and difficult to debug out of memory failures. It should manage the collection cache in order to cap RAM usage.


Subtasks 1 (0 open1 closed)

Task #12569: Review 12568-cwl-collection-cacheResolvedPeter Amstutz11/07/2017Actions
Actions #1

Updated by Peter Amstutz over 6 years ago

  • Description updated (diff)
Actions #2

Updated by Peter Amstutz over 6 years ago

  • Description updated (diff)
Actions #3

Updated by Peter Amstutz over 6 years ago

  • Status changed from New to In Progress
  • Assigned To set to Peter Amstutz
  • Target version set to 2017-11-08 Sprint

12568-cwl-collection-cache @ d59645f3e566e691ba757f74bba503c13773dbe8

Actions #4

Updated by Peter Amstutz over 6 years ago

Added a couple unit tests. Now @ a70727762dafee667b022331307f6c0f949fd7e7

Actions #5

Updated by Peter Amstutz over 6 years ago

One more test update for completeness! a18005f8b35a68b4fcd9ccdf76832b28e564289c

Actions #6

Updated by Lucas Di Pentima over 6 years ago

The cache size management looks good to me.

If some enormous collections get into the cache and make a-c-r to be OOMed, shouldn't the cache parameters be modifiable via a-c-r flags?

Actions #7

Updated by Peter Amstutz over 6 years ago

Lucas Di Pentima wrote:

The cache size management looks good to me.

If some enormous collections get into the cache and make a-c-r to be OOMed, shouldn't the cache parameters be modifiable via a-c-r flags?

If the collection is really enormous (such as #12575) it will overflow the cache, and min_entries comes into play. We have to load the whole collection regardless. However to avoid caching too many huge collections I reduced min_entries from 4 to 2.

I believe default cache parameters are reasonably scaled to the 1 core / 3.5 GiB nodes that we ideally want to use for running workflows.

Adding a command line to adjust the cache size isn't quite as straightforward as it seems, because the command line parameter also needs to be propagated to the job or container request. It is probably a bigger change (with its own testing implications) than the underlying fix.

12568-cwl-collection-cache @ 8de691c25eac0454f8f30cfa35eccff15642e330

Actions #8

Updated by Lucas Di Pentima over 6 years ago

Ok, this fix seems to be enough, I was just thinking that maybe we can give the user more knobs to try to solve a potential similar problem, but I suppose she/he could also change the runtime requirements to ask for bigger nodes when detecting a OOM situation.

LGTM, please merge.

Actions #9

Updated by Peter Amstutz over 6 years ago

Lucas Di Pentima wrote:

Ok, this fix seems to be enough, I was just thinking that maybe we can give the user more knobs to try to solve a potential similar problem, but I suppose she/he could also change the runtime requirements to ask for bigger nodes when detecting a OOM situation.

Yes, that's right. I believe if the default cache settings are not sufficent, fiddling with the cache settings is unlikely to help, the user probably needs a bigger node anyway.

LGTM, please merge.

Thanks.

Actions #10

Updated by Anonymous over 6 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 0 to 100

Applied in changeset arvados|commit:fc128bc6497aa266f925e2aa4821bde6fce9aade.

Actions

Also available in: Atom PDF