Bug #12568

[CWL] Cap size of collection cache

Added by Peter Amstutz over 1 year ago. Updated over 1 year ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Start date:
11/07/2017
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
-

Description

The collection cache in arvados-cwl-runner is unbounded. When running workflows with many very large collections, this can result in gigabytes of RAM usage (requiring larger nodes for the workflow runner) and difficult to debug out of memory failures. It should manage the collection cache in order to cap RAM usage.


Subtasks

Task #12569: Review 12568-cwl-collection-cacheResolvedPeter Amstutz

Associated revisions

Revision fc128bc6
Added by Peter Amstutz over 1 year ago

Merge branch '12568-cwl-collection-cache' closes #12568

Arvados-DCO-1.1-Signed-off-by: Peter Amstutz <>

History

#1 Updated by Peter Amstutz over 1 year ago

  • Description updated (diff)

#2 Updated by Peter Amstutz over 1 year ago

  • Description updated (diff)

#3 Updated by Peter Amstutz over 1 year ago

  • Status changed from New to In Progress
  • Assigned To set to Peter Amstutz
  • Target version set to 2017-11-08 Sprint

12568-cwl-collection-cache @ d59645f3e566e691ba757f74bba503c13773dbe8

#4 Updated by Peter Amstutz over 1 year ago

Added a couple unit tests. Now @ a70727762dafee667b022331307f6c0f949fd7e7

#5 Updated by Peter Amstutz over 1 year ago

One more test update for completeness! a18005f8b35a68b4fcd9ccdf76832b28e564289c

#6 Updated by Lucas Di Pentima over 1 year ago

The cache size management looks good to me.

If some enormous collections get into the cache and make a-c-r to be OOMed, shouldn't the cache parameters be modifiable via a-c-r flags?

#7 Updated by Peter Amstutz over 1 year ago

Lucas Di Pentima wrote:

The cache size management looks good to me.

If some enormous collections get into the cache and make a-c-r to be OOMed, shouldn't the cache parameters be modifiable via a-c-r flags?

If the collection is really enormous (such as #12575) it will overflow the cache, and min_entries comes into play. We have to load the whole collection regardless. However to avoid caching too many huge collections I reduced min_entries from 4 to 2.

I believe default cache parameters are reasonably scaled to the 1 core / 3.5 GiB nodes that we ideally want to use for running workflows.

Adding a command line to adjust the cache size isn't quite as straightforward as it seems, because the command line parameter also needs to be propagated to the job or container request. It is probably a bigger change (with its own testing implications) than the underlying fix.

12568-cwl-collection-cache @ 8de691c25eac0454f8f30cfa35eccff15642e330

#8 Updated by Lucas Di Pentima over 1 year ago

Ok, this fix seems to be enough, I was just thinking that maybe we can give the user more knobs to try to solve a potential similar problem, but I suppose she/he could also change the runtime requirements to ask for bigger nodes when detecting a OOM situation.

LGTM, please merge.

#9 Updated by Peter Amstutz over 1 year ago

Lucas Di Pentima wrote:

Ok, this fix seems to be enough, I was just thinking that maybe we can give the user more knobs to try to solve a potential similar problem, but I suppose she/he could also change the runtime requirements to ask for bigger nodes when detecting a OOM situation.

Yes, that's right. I believe if the default cache settings are not sufficent, fiddling with the cache settings is unlikely to help, the user probably needs a bigger node anyway.

LGTM, please merge.

Thanks.

#10 Updated by Anonymous over 1 year ago

  • Status changed from In Progress to Resolved
  • % Done changed from 0 to 100

Applied in changeset arvados|commit:fc128bc6497aa266f925e2aa4821bde6fce9aade.

Also available in: Atom PDF