Bug #13100
closed[crunch-run] Replace custom manifest-writing code with collectionFS
Description
After a lot of debugging trying to figure out why every job was being killed for exceeding memory limits seemingly regardless of the limit specified, I finally caught crunch-run in the act of consuming a truly massive amount of RAM towards the end of a job (in the end I just set container requirement for nearly all the RAM on the node, so I could see what was going on).
Mid run, crunch-run was using ~200MB, but then all of a sudden it started to allocate more memory at a rate of 850MB/s until 30s later when it had consumed 25.5GB, at which point it levelled off and held steady for 230s after which the job finished successfully. It had the full 25.5GB allocated until it exited (or within a second of when it exited).
Looking at the container logs shows that the point in time when it started to allocate lots of RAM corresponds with the end of the container, beginning at the "Container exited with code: 0" line and proceeding to upload the output, which in this case was specified (in CWL) as a Directory.
The output collection (5ca01264c4721b24c9d36320a00027ce+328812) contains 4005 files totalling 8.9GiB - so crunch-run allocated enough memory to cache the full output collection in memory 2.8x over, which seems somewhat excessive.
Updated by Joshua Randall almost 7 years ago
This may be related to https://dev.arvados.org/issues/11583 which points out an apparent memory leak in crunch-run while running tests, but it may be a separate issue related specifically to the end-of-job upload phase.
During the container run, crunch-run memory started around 65MB and climbed steadily throughout the job up to the 200MB point (that pattern seems more like the behaviour described in 11583), but when the upload phase begins it suddenly allocates a much larger amount of memory, which may be a separate issue. I'll try to see if I can come up with a test that demonstrates this issue.
Updated by Thomas Hickman almost 7 years ago
The above report had
outputDirType: local_output_dir. We tried this again with
outputDirType: keep_output_dirand still gradually grew to 181 MB then jumped to 768 MB in the last two seconds (this is still a problem, but not as bad and shorter lived. We also tried
loadListing: no_listingwhich grew to 182 MB then jumped to 707MB in the last two seconds.
Updated by Tom Clegg almost 7 years ago
- Related to Bug #11583: [crunch-run] Fix excessive memory use added
Updated by Tom Clegg almost 7 years ago
- Related to Idea #13048: Refactor crunch2 logging added
Updated by Tom Clegg almost 7 years ago
- Status changed from New to In Progress
- Assigned To set to Tom Clegg
- 13100-crunch-run-memory @ bc27c169f78471a21f6f5c35a879e4b57afa0e22
- 13100-crunch-run-memory @ 304a807e24363f3c38949b2da10c500f338d08b0
Together, these reduce test mem usage by more than half, to ~3.2GB.
With all collection-writing disabled, test suite ends with rss ~300M, so I'm guessing there's another ~3GB worth of similar problems.
Updated by Peter Amstutz almost 7 years ago
Tom Clegg wrote:
Reduce test mem by ~2 GB by setting stream.Block to nil after sending it to uploader in (*CollectionWriter)Finish()
- 13100-crunch-run-memory @ bc27c169f78471a21f6f5c35a879e4b57afa0e22
That is 64 MiB per file so, good catch!
Reduce test mem by ~2.6 GB by releasing the reference to the last block written in each test case (KeepTestClient.Content)
- 13100-crunch-run-memory @ 304a807e24363f3c38949b2da10c500f338d08b0
Together, these reduce test mem usage by more than half, to ~3.2GB.
With all collection-writing disabled, test suite ends with rss ~300M, so I'm guessing there's another ~3GB worth of similar problems.
This fix LGTM but it would nice to find the root cause of why lingers, I agree it is likely to be an unterminated goroutine.
Updated by Tom Morris almost 7 years ago
- Target version changed from 2018-03-28 Sprint to 2018-04-11 Sprint
- Story points set to 2.0
Updated by Tom Clegg almost 7 years ago
- Subject changed from crunch-run consumes a MASSIVE amount of ram at the end of a job after the container exits to [crunch-run] Replace custom manifest-writing code with collectionFS
Updated by Tom Clegg almost 7 years ago
The hard part turned out to be refactoring the "copy output" stuff, disentangling the symlink/mount traversal logic from the copying of individual files. This also made it feasible to test various mount scenarios without resorting to comparing literal manifest text.
Test suite's memory use looks much better, as hoped:Updated by Peter Amstutz almost 7 years ago
- This description on "copier" is slightly misleading, since ContainerRunner() only uses the high level Copy() method. The comment should probably go on the Copy() method.
// To use a copier: first call walkMount() to inspect the output // directory, grab the necessary parts of already-stored collections, // and prepare a list of files that need to be copied from the local // filesystem; then call commit() to copy the file data and return a // complete output manifest.
- Since you refactored/rewrote the symlink traversal code, can you see if https://dev.arvados.org/issues/12606 is fixed?
Updated by Peter Amstutz almost 7 years ago
I ran the CWL test suite in arvbox using the latest crunch-run, this tests all kinds of weird crunch-run uploader cases. I got some errors:
Test [88/128] Test failed: /home/peter/work/scripts/venv/bin/arvados-cwl-runner --api=containers --compute-checksum --disable-reuse --outdir=/tmp/tmp6N9Bhe --quiet v1.0/stagefile.cwl v1.0/stagefile-job.yml Test writable staged files. Returned non-zero 2018-04-06 16:15:16 arvados.cwl-runner WARNING: Overall process status is permanentFail 2018-04-06 16:15:16 cwltool WARNING: Final process status is permanentFail Test [89/128] Test [90/128] Test [91/128] Test [92/128] Test [93/128] Test [94/128] Test [95/128] Test [96/128] Test [97/128] Test [98/128] Test [99/128] Test [100/128] Test [101/128] Test [102/128] Test [103/128] Test [104/128] Test [105/128] Test [106/128] Test failed: /home/peter/work/scripts/venv/bin/arvados-cwl-runner --api=containers --compute-checksum --disable-reuse --outdir=/tmp/tmpxCQ1oW --quiet v1.0/recursive-input-directory.cwl v1.0/recursive-input-directory.yml Test if a writable input directory is recursivly copied and writable Returned non-zero 2018-04-06 16:21:46 arvados.cwl-runner WARNING: Overall process status is permanentFail 2018-04-06 16:21:46 cwltool WARNING: Final process status is permanentFail Test [107/128] Test [108/128] Test [109/128] Test [110/128] Test [111/128] Test [112/128] Test [113/128] Test [114/128] Test [115/128] Test [116/128] Test failed: /home/peter/work/scripts/venv/bin/arvados-cwl-runner --api=containers --compute-checksum --disable-reuse --outdir=/tmp/tmpKfAbFr --quiet v1.0/writable-dir.cwl v1.0/empty.json Test empty writable dir with InitialWorkDirRequirement Returned non-zero 2018-04-06 16:26:03 arvados.cwl-runner WARNING: Overall process status is permanentFail 2018-04-06 16:26:03 cwltool WARNING: Final process status is permanentFail Test [117/128] Test failed: /home/peter/work/scripts/venv/bin/arvados-cwl-runner --api=containers --compute-checksum --disable-reuse --outdir=/tmp/tmprRlgXg --quiet v1.0/writable-dir-docker.cwl v1.0/empty.json Test empty writable dir with InitialWorkDirRequirement inside Docker Returned non-zero 2018-04-06 16:26:24 arvados.cwl-runner WARNING: Overall process status is permanentFail 2018-04-06 16:26:24 cwltool WARNING: Final process status is permanentFail Test [118/128] Test [119/128] Test [120/128] Test [121/128] Test failed: /home/peter/work/scripts/venv/bin/arvados-cwl-runner --api=containers --compute-checksum --disable-reuse --outdir=/tmp/tmp_aCW9A --quiet v1.0/iwdr_with_nested_dirs.cwl v1.0/empty.json Test InitialWorkDirRequirement with a nested directory structure from another step Returned non-zero 2018-04-06 16:28:08 arvados.cwl-runner WARNING: Overall process status is permanentFail 2018-04-06 16:28:08 cwltool WARNING: Final process status is permanentFail
I'll have to track down exactly what failed in these cases.
Updated by Tom Clegg almost 7 years ago
"Test if a writable input directory is recursivly copied and writable" ... I'm guessing this is "writable collection mounted below output path" which currently fails ("cannot ...: writable collection mounted at ...") but it should be supported -- the files will have been copied into the output dir/collection rather than mounted so we just need to skip the mount treatment.
Updated by Peter Amstutz almost 7 years ago
Tom Clegg wrote:
"Test if a writable input directory is recursivly copied and writable" ... I'm guessing this is "writable collection mounted below output path" which currently fails ("cannot ...: writable collection mounted at ...") but it should be supported -- the files will have been copied into the output dir/collection rather than mounted so we just need to skip the mount treatment.
Ok, rerunning tests.
Updated by Tom Clegg almost 7 years ago
- handles writable collections mounted below output dir
- handles symlinks to files/dirs in writable collections mounted elsewhere
- adds test case for relative symlink to relative symlink (which seems to be the theory about #12606, although I haven't confirmed that scenario was broken before)
Updated by Tom Clegg almost 7 years ago
- Related to Bug #12606: Symlink in output points to invalid location -- no such file or directory added
Updated by Peter Amstutz almost 7 years ago
I've now run a bunch of CWL tests on 13100-crunch-run-output and they all seem to be passing, so LGTM.
Updated by Tom Clegg almost 7 years ago
- Status changed from In Progress to Resolved
- % Done changed from 50 to 100
Applied in changeset arvados|95be914af0ab0a82c4fa92b3f9c29ebec88e8595.