Project

General

Profile

Actions

Bug #19563

closed

crunch-run upload step possibly buffers too much

Added by Peter Amstutz 4 months ago. Updated about 2 months ago.

Status:
Closed
Priority:
Normal
Assigned To:
Category:
Crunch
Target version:
Start date:
10/25/2022
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
-
Release relationship:
Auto

Description

We have crunch-run processes that are getting OOM killed (and restarted) in the upload phase.

Crunch-run is uploading very large files (30+ GB) and running on very small nodes (t3.small) which have 1 core, 2 GB RAM, and throttled network bandwidth. The hoststat numbers show a much greater amount of data being received than transmitted.

The suspicion is that the crunch-run process is buffering data in RAM, which is piling up until it gets OOM killed.

a) determine if it is true that the queue of blocks to be uploaded is uncapped

b) if so, make it possible to set some cap which ensures there is backpressure that will block the uploader until there is more buffer space. Experimentally, I think we've found optimal upload rates with around 4-6 parallel block uploads.


Subtasks 1 (0 open1 closed)

Task #19577: Review 19563-log-cr-memResolvedPeter Amstutz10/25/2022

Actions
Actions #1

Updated by Peter Amstutz 4 months ago

  • Description updated (diff)
Actions #2

Updated by Peter Amstutz 4 months ago

  • Assigned To set to Tom Clegg
Actions #3

Updated by Tom Clegg 4 months ago

  • Status changed from New to In Progress
Collectionfs limits itself to 4 concurrent block write operations. When a write() fills a block, it waits until one of the 4 block-write slots is available before returning to the caller. So, expected memory use is
  • 4x 64 MiB blocks waiting to write to keepstore
  • 1x 64 MiB block waiting for a write-to-keepstore slot
  • 1x <64 MiB block at the end of each file (which could cause a problem when writing a large number of files if crunch-run's "flush after writing a large file" thing isn't working, but here we're seeing the crash after writing only 1 or 2 files)
Other components using memory include
  • arv-mount (keep_cache_ram was 268435456 and after reading a lot of data it seems safe to assume it's still got the maximum amount allocated). I don't think there's any situation where the fuse mount is accessed after the "copying output files" stage starts, and in that case crunch-run could reduce memory use by killing the arv-mount process earlier (in the present case there are individual files from a collection mounted inside the output dir, so the output-copying phase does require arv-mount)
  • keepstore itself (LocalKeepBlobBuffersPerVCPU=1 and VCPUs=1 so keepstore should only be using 1x 64 MiB)

All this adds up to ~11 blocks, 704 MiB, which doesn't sound like it should be a problem for a node with 2 GB RAM.

However, hoststat.txt shows memory use growing ~1 GB just as the container ends and crunch-run starts copying a 29 GB file, then growing ~100MB soon after it starts copying a 5 GB file.

hoststat   2022-09-15T19:34:44.561985895Z mem 713097216 cache 0 swap 3050 pgmajfault 637743104 rss
crunch-run 2022-09-15T19:34:46.079622438Z Container exited with status code 0
crunch-run 2022-09-15T19:34:46.124137478Z copying "/outputs/xxxxxxxx" (29123123123 bytes)
hoststat   2022-09-15T19:34:54.561348376Z mem 305278976 cache 0 swap 3391 pgmajfault 1475633152 rss
hoststat   2022-09-15T19:44:54.563159520Z mem 110309376 cache 0 swap 26601 pgmajfault 1678643200 rss
crunch-run 2022-09-15T19:44:58.123005311Z copying "/outputs/xxxxxxxx" (5123123123 bytes)
hoststat   2022-09-15T19:45:04.565925880Z mem 130220032 cache 0 swap 27024 pgmajfault 1678675968 rss
hoststat   2022-09-15T19:45:14.570719935Z mem 47415296 cache 0 swap 28308 pgmajfault 1780465664 rss
Actions #4

Updated by Tom Clegg 4 months ago

The failing container essentially does "gunzip *.gz" and the first two files in the logs quoted above are file1 and file1.gz. This means the ~1 GB increase in memory usage happens when crunch-run starts to copy the uncompressed file1 from local scratch space → native keep client → local keepstore → remove keep volume.

Actions #5

Updated by Tom Clegg 4 months ago

We haven't seen this on regular (non-fractional) VM types. It's still unclear which process is using all this extra memory. Most likely culprit is crunch-run itself, but could be keepstore, arv-mount, docker, or something else that gets triggered by a container finishing.

We should add hoststat.txt entries for the memory footprint of crunch-run, arv-mount, and keepstore.

This should provide good info for troubleshooting next time this (or another memory issue) comes up.

Actions #6

Updated by Peter Amstutz 4 months ago

  • Target version changed from 2022-10-12 sprint to 2022-10-26 sprint
Actions #7

Updated by Tom Clegg 3 months ago

19563-log-cr-mem @ e55bd0d3b54494061d54853c4b613ad680ffb6a9 -- developer-run-tests: #3335

adds a "procmem" line to crunchstat output.

Testing dev version on ce8i5:

2022-10-25T14:28:48.056830530Z crunch-run e55bd0d3b54494061d54853c4b613ad680ffb6a9-dev (go1.19.2) started
...
2022-10-25T14:28:51.941425414Z mem 731615232 cache 1485 pgmajfault 179535872 rss
2022-10-25T14:28:51.941494214Z procmem 10196 crunch-run 9701 keepstore
...
2022-10-25T14:29:01.943707056Z mem 1134686208 cache 1485 pgmajfault 539070464 rss
2022-10-25T14:29:01.943801556Z procmem 61176 arv-mount 10196 crunch-run 27555 keepstore
Actions #8

Updated by Tom Clegg 3 months ago

19563-log-cr-mem @ 8860ada0a6afd6adb2b26d5b0ab161bfe66c6019 -- developer-run-tests: #3339

Fixes units (the numbers in /proc/X/stat are # pages, typically 4 kiB)

2022-10-25T18:12:42.964863796Z crunch-run 8860ada0a6afd6adb2b26d5b0ab161bfe66c6019-dev (go1.19.2) started
...
2022-10-25T18:12:46.971549619Z mem 736526336 cache 1485 pgmajfault 180895744 rss
2022-10-25T18:12:46.991653214Z procmem 45219840 crunch-run 38080512 keepstore
...
2022-10-25T18:12:56.972254253Z mem 1126486016 cache 1485 pgmajfault 550330368 rss
2022-10-25T18:12:56.972366598Z procmem 259072000 arv-mount 45219840 crunch-run 112295936 keepstore

That looks more realistic.

Actions #9

Updated by Peter Amstutz 3 months ago

  • Target version changed from 2022-10-26 sprint to 2022-11-09 sprint
Actions #10

Updated by Peter Amstutz 3 months ago

Tom Clegg wrote in #note-8:

19563-log-cr-mem @ 8860ada0a6afd6adb2b26d5b0ab161bfe66c6019 -- developer-run-tests: #3339

Fixes units (the numbers in /proc/X/stat are # pages, typically 4 kiB)

[...]

That looks more realistic.

LGTM

Actions #11

Updated by Tom Clegg 3 months ago

  • Status changed from In Progress to Closed

The mystery isn't resolved, but this seems to be all we can do about it for now. Closing.

Actions #12

Updated by Peter Amstutz about 2 months ago

  • Release set to 47
Actions

Also available in: Atom PDF