https://dev.arvados.org/https://dev.arvados.org/favicon.ico?15576888422022-09-21T13:56:32ZArvadosArvados - Bug #19563: crunch-run upload step possibly buffers too muchhttps://dev.arvados.org/issues/19563?journal_id=1065342022-09-21T13:56:32ZPeter Amstutzpeter.amstutz@curii.com
<ul><li><strong>Description</strong> updated (<a title="View differences" href="/journals/106534/diff?detail_id=103112">diff</a>)</li></ul> Arvados - Bug #19563: crunch-run upload step possibly buffers too muchhttps://dev.arvados.org/issues/19563?journal_id=1067972022-09-28T15:40:44ZPeter Amstutzpeter.amstutz@curii.com
<ul><li><strong>Assigned To</strong> set to <i>Tom Clegg</i></li></ul> Arvados - Bug #19563: crunch-run upload step possibly buffers too muchhttps://dev.arvados.org/issues/19563?journal_id=1068952022-10-03T15:39:23ZTom Cleggtom@curii.com
<ul><li><strong>Status</strong> changed from <i>New</i> to <i>In Progress</i></li></ul>Collectionfs limits itself to 4 concurrent block write operations. When a write() fills a block, it waits until one of the 4 block-write slots is available before returning to the caller. So, expected memory use is
<ul>
<li>4x 64 MiB blocks waiting to write to keepstore</li>
<li>1x 64 MiB block waiting for a write-to-keepstore slot</li>
<li>1x <64 MiB block at the end of each file (which could cause a problem when writing a large number of files if crunch-run's "flush after writing a large file" thing isn't working, but here we're seeing the crash after writing only 1 or 2 files)</li>
</ul>
Other components using memory include
<ul>
<li>arv-mount (keep_cache_ram was 268435456 and after reading a lot of data it seems safe to assume it's still got the maximum amount allocated). <del>I don't think there's any situation where the fuse mount is accessed after the "copying output files" stage starts, and in that case crunch-run could reduce memory use by killing the arv-mount process earlier</del> (in the present case there are individual files from a collection mounted inside the output dir, so the output-copying phase does require arv-mount)</li>
<li>keepstore itself (LocalKeepBlobBuffersPerVCPU=1 and VCPUs=1 so keepstore should only be using 1x 64 MiB)</li>
</ul>
<p>All this adds up to ~11 blocks, 704 MiB, which doesn't sound like it should be a problem for a node with 2 GB RAM.</p>
<p>However, hoststat.txt shows memory use growing ~1 GB just as the container ends and crunch-run starts copying a 29 GB file, then growing ~100MB soon after it starts copying a 5 GB file.</p>
<pre>
hoststat 2022-09-15T19:34:44.561985895Z mem 713097216 cache 0 swap 3050 pgmajfault 637743104 rss
crunch-run 2022-09-15T19:34:46.079622438Z Container exited with status code 0
crunch-run 2022-09-15T19:34:46.124137478Z copying "/outputs/xxxxxxxx" (29123123123 bytes)
hoststat 2022-09-15T19:34:54.561348376Z mem 305278976 cache 0 swap 3391 pgmajfault 1475633152 rss
hoststat 2022-09-15T19:44:54.563159520Z mem 110309376 cache 0 swap 26601 pgmajfault 1678643200 rss
crunch-run 2022-09-15T19:44:58.123005311Z copying "/outputs/xxxxxxxx" (5123123123 bytes)
hoststat 2022-09-15T19:45:04.565925880Z mem 130220032 cache 0 swap 27024 pgmajfault 1678675968 rss
hoststat 2022-09-15T19:45:14.570719935Z mem 47415296 cache 0 swap 28308 pgmajfault 1780465664 rss
</pre> Arvados - Bug #19563: crunch-run upload step possibly buffers too muchhttps://dev.arvados.org/issues/19563?journal_id=1069032022-10-03T19:49:53ZTom Cleggtom@curii.com
<ul></ul><p>The failing container essentially does "gunzip *.gz" and the first two files in the logs quoted above are file1 and file1.gz. This means the ~1 GB increase in memory usage happens when crunch-run <em>starts</em> to copy the uncompressed file1 from local scratch space → native keep client → local keepstore → remove keep volume.</p> Arvados - Bug #19563: crunch-run upload step possibly buffers too muchhttps://dev.arvados.org/issues/19563?journal_id=1069422022-10-05T19:01:50ZTom Cleggtom@curii.com
<ul></ul><p>We haven't seen this on regular (non-fractional) VM types. It's still unclear which process is using all this extra memory. Most likely culprit is crunch-run itself, but could be keepstore, arv-mount, docker, or something else that gets triggered by a container finishing.</p>
<p>We should add hoststat.txt entries for the memory footprint of crunch-run, arv-mount, and keepstore.</p>
<p>This should provide good info for troubleshooting next time this (or another memory issue) comes up.</p> Arvados - Bug #19563: crunch-run upload step possibly buffers too muchhttps://dev.arvados.org/issues/19563?journal_id=1070752022-10-12T15:12:08ZPeter Amstutzpeter.amstutz@curii.com
<ul><li><strong>Target version</strong> changed from <i>2022-10-12 sprint</i> to <i>2022-10-26 sprint</i></li></ul> Arvados - Bug #19563: crunch-run upload step possibly buffers too muchhttps://dev.arvados.org/issues/19563?journal_id=1072642022-10-24T19:17:16ZTom Cleggtom@curii.com
<ul></ul><p>19563-log-cr-mem @ <a class="changeset" title="19563: Clean up tests. Arvados-DCO-1.1-Signed-off-by: Tom Clegg <tom@curii.com>" href="https://dev.arvados.org/projects/arvados/repository/arvados/revisions/e55bd0d3b54494061d54853c4b613ad680ffb6a9">e55bd0d3b54494061d54853c4b613ad680ffb6a9</a> -- <a class="external" href="https://ci.arvados.org/job/developer-run-tests/3335/"<a href="https://ci.arvados.org/job/developer-run-tests/3335/">developer-run-tests: #3335 <img src="https://ci.arvados.org/buildStatus/icon?job=developer-run-tests&build=3335" alt="" /></a></a></p>
<p>adds a "procmem" line to crunchstat output.</p>
<p>Testing dev version on ce8i5:</p>
<pre>
2022-10-25T14:28:48.056830530Z crunch-run e55bd0d3b54494061d54853c4b613ad680ffb6a9-dev (go1.19.2) started
...
2022-10-25T14:28:51.941425414Z mem 731615232 cache 1485 pgmajfault 179535872 rss
2022-10-25T14:28:51.941494214Z procmem 10196 crunch-run 9701 keepstore
...
2022-10-25T14:29:01.943707056Z mem 1134686208 cache 1485 pgmajfault 539070464 rss
2022-10-25T14:29:01.943801556Z procmem 61176 arv-mount 10196 crunch-run 27555 keepstore
</pre> Arvados - Bug #19563: crunch-run upload step possibly buffers too muchhttps://dev.arvados.org/issues/19563?journal_id=1072842022-10-25T18:15:55ZTom Cleggtom@curii.com
<ul></ul><p>19563-log-cr-mem @ <a class="changeset" title="19563: Report process size in bytes, not pages. Arvados-DCO-1.1-Signed-off-by: Tom Clegg <tom@cu..." href="https://dev.arvados.org/projects/arvados/repository/arvados/revisions/8860ada0a6afd6adb2b26d5b0ab161bfe66c6019">8860ada0a6afd6adb2b26d5b0ab161bfe66c6019</a> -- <a class="external" href="https://ci.arvados.org/job/developer-run-tests/3339/"<a href="https://ci.arvados.org/job/developer-run-tests/3339/">developer-run-tests: #3339 <img src="https://ci.arvados.org/buildStatus/icon?job=developer-run-tests&build=3339" alt="" /></a></a></p>
<p>Fixes units (the numbers in /proc/X/stat are # pages, typically 4 kiB)</p>
<pre>
2022-10-25T18:12:42.964863796Z crunch-run 8860ada0a6afd6adb2b26d5b0ab161bfe66c6019-dev (go1.19.2) started
...
2022-10-25T18:12:46.971549619Z mem 736526336 cache 1485 pgmajfault 180895744 rss
2022-10-25T18:12:46.991653214Z procmem 45219840 crunch-run 38080512 keepstore
...
2022-10-25T18:12:56.972254253Z mem 1126486016 cache 1485 pgmajfault 550330368 rss
2022-10-25T18:12:56.972366598Z procmem 259072000 arv-mount 45219840 crunch-run 112295936 keepstore
</pre>
<p>That looks more realistic.</p> Arvados - Bug #19563: crunch-run upload step possibly buffers too muchhttps://dev.arvados.org/issues/19563?journal_id=1073072022-10-26T15:16:57ZPeter Amstutzpeter.amstutz@curii.com
<ul><li><strong>Target version</strong> changed from <i>2022-10-26 sprint</i> to <i>2022-11-09 sprint</i></li></ul> Arvados - Bug #19563: crunch-run upload step possibly buffers too muchhttps://dev.arvados.org/issues/19563?journal_id=1074342022-10-26T18:50:29ZPeter Amstutzpeter.amstutz@curii.com
<ul></ul><p>Tom Clegg wrote in <a href="#note-8">#note-8</a>:</p>
<blockquote>
<p>19563-log-cr-mem @ <a class="changeset" title="19563: Report process size in bytes, not pages. Arvados-DCO-1.1-Signed-off-by: Tom Clegg <tom@cu..." href="https://dev.arvados.org/projects/arvados/repository/arvados/revisions/8860ada0a6afd6adb2b26d5b0ab161bfe66c6019">8860ada0a6afd6adb2b26d5b0ab161bfe66c6019</a> -- <a class="external" href="https://ci.arvados.org/job/developer-run-tests/3339/"<a href="https://ci.arvados.org/job/developer-run-tests/3339/">developer-run-tests: #3339 <img src="https://ci.arvados.org/buildStatus/icon?job=developer-run-tests&build=3339" alt="" /></a></a></p>
<p>Fixes units (the numbers in /proc/X/stat are # pages, typically 4 kiB)</p>
<p>[...]</p>
<p>That looks more realistic.</p>
</blockquote>
<p>LGTM</p> Arvados - Bug #19563: crunch-run upload step possibly buffers too muchhttps://dev.arvados.org/issues/19563?journal_id=1074552022-10-27T15:01:30ZTom Cleggtom@curii.com
<ul><li><strong>Status</strong> changed from <i>In Progress</i> to <i>Closed</i></li></ul><p>The mystery isn't resolved, but this seems to be all we can do about it for now. Closing.</p> Arvados - Bug #19563: crunch-run upload step possibly buffers too muchhttps://dev.arvados.org/issues/19563?journal_id=1090682022-12-13T15:32:23ZPeter Amstutzpeter.amstutz@curii.com
<ul><li><strong>Release</strong> set to <i>47</i></li></ul>