Bug #13100

[crunch-run] Replace custom manifest-writing code with collectionFS

Added by Joshua Randall about 2 years ago. Updated over 1 year ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Start date:
03/15/2018
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
2.0
Release:
Release relationship:
Auto

Description

After a lot of debugging trying to figure out why every job was being killed for exceeding memory limits seemingly regardless of the limit specified, I finally caught crunch-run in the act of consuming a truly massive amount of RAM towards the end of a job (in the end I just set container requirement for nearly all the RAM on the node, so I could see what was going on).

Mid run, crunch-run was using ~200MB, but then all of a sudden it started to allocate more memory at a rate of 850MB/s until 30s later when it had consumed 25.5GB, at which point it levelled off and held steady for 230s after which the job finished successfully. It had the full 25.5GB allocated until it exited (or within a second of when it exited).

Looking at the container logs shows that the point in time when it started to allocate lots of RAM corresponds with the end of the container, beginning at the "Container exited with code: 0" line and proceeding to upload the output, which in this case was specified (in CWL) as a Directory.

The output collection (5ca01264c4721b24c9d36320a00027ce+328812) contains 4005 files totalling 8.9GiB - so crunch-run allocated enough memory to cache the full output collection in memory 2.8x over, which seems somewhat excessive.


Subtasks

Task #13233: Review 13100-crunch-run-memoryResolvedPeter Amstutz

Task #13294: Review 13100-crunch-run-outputResolvedPeter Amstutz


Related issues

Related to Arvados - Bug #11583: [crunch-run] Fix excessive memory useResolved

Related to Arvados - Story #13048: Refactor crunch2 loggingNew

Related to Arvados - Bug #12606: Symlink in output points to invalid location -- no such file or directoryResolved11/17/2017

Associated revisions

Revision 05bf879d
Added by Tom Clegg almost 2 years ago

Merge branch '13100-crunch-run-memory'

refs #13100
refs #11583

Arvados-DCO-1.1-Signed-off-by: Tom Clegg <>

Revision 95be914a
Added by Tom Clegg almost 2 years ago

Merge branch '13100-crunch-run-output'

fixes #13100
fixes #11583
fixes #12606
refs #13048

Arvados-DCO-1.1-Signed-off-by: Tom Clegg <>

History

#1 Updated by Joshua Randall about 2 years ago

This may be related to https://dev.arvados.org/issues/11583 which points out an apparent memory leak in crunch-run while running tests, but it may be a separate issue related specifically to the end-of-job upload phase.

During the container run, crunch-run memory started around 65MB and climbed steadily throughout the job up to the 200MB point (that pattern seems more like the behaviour described in 11583), but when the upload phase begins it suddenly allocates a much larger amount of memory, which may be a separate issue. I'll try to see if I can come up with a test that demonstrates this issue.

#2 Updated by Thomas Hickman about 2 years ago

The above report had

outputDirType: local_output_dir
. We tried this again with
outputDirType: keep_output_dir
and still gradually grew to 181 MB then jumped to 768 MB in the last two seconds (this is still a problem, but not as bad and shorter lived. We also tried
loadListing: no_listing
which grew to 182 MB then jumped to 707MB in the last two seconds.

#3 Updated by Tom Clegg almost 2 years ago

  • Related to Bug #11583: [crunch-run] Fix excessive memory use added

#4 Updated by Tom Clegg almost 2 years ago

#5 Updated by Tom Clegg almost 2 years ago

  • Status changed from New to In Progress
  • Assigned To set to Tom Clegg
Reduce test mem by ~2 GB by setting stream.Block to nil after sending it to uploader in (*CollectionWriter)Finish() Reduce test mem by ~2.6 GB by releasing the reference to the last block written in each test case (KeepTestClient.Content)

Together, these reduce test mem usage by more than half, to ~3.2GB.

With all collection-writing disabled, test suite ends with rss ~300M, so I'm guessing there's another ~3GB worth of similar problems.

#6 Updated by Tom Clegg almost 2 years ago

  • Target version set to 2018-03-28 Sprint

#7 Updated by Peter Amstutz almost 2 years ago

Tom Clegg wrote:

Reduce test mem by ~2 GB by setting stream.Block to nil after sending it to uploader in (*CollectionWriter)Finish()

That is 64 MiB per file so, good catch!

Reduce test mem by ~2.6 GB by releasing the reference to the last block written in each test case (KeepTestClient.Content)

Together, these reduce test mem usage by more than half, to ~3.2GB.

With all collection-writing disabled, test suite ends with rss ~300M, so I'm guessing there's another ~3GB worth of similar problems.

This fix LGTM but it would nice to find the root cause of why lingers, I agree it is likely to be an unterminated goroutine.

#8 Updated by Tom Morris almost 2 years ago

  • Target version changed from 2018-03-28 Sprint to 2018-04-11 Sprint
  • Story points set to 2.0

#9 Updated by Tom Clegg almost 2 years ago

  • Subject changed from crunch-run consumes a MASSIVE amount of ram at the end of a job after the container exits to [crunch-run] Replace custom manifest-writing code with collectionFS

#10 Updated by Tom Clegg almost 2 years ago

The hard part turned out to be refactoring the "copy output" stuff, disentangling the symlink/mount traversal logic from the copying of individual files. This also made it feasible to test various mount scenarios without resorting to comparing literal manifest text.

Test suite's memory use looks much better, as hoped:
  • before (master @ c4bd314ff): 2645M rss, 31.153s
  • after (13100-crunch-run-output @ 826bc4316): 349M rss, 21.368s

#11 Updated by Peter Amstutz almost 2 years ago

  • This description on "copier" is slightly misleading, since ContainerRunner() only uses the high level Copy() method. The comment should probably go on the Copy() method.
    // To use a copier: first call walkMount() to inspect the output
    // directory, grab the necessary parts of already-stored collections,
    // and prepare a list of files that need to be copied from the local
    // filesystem; then call commit() to copy the file data and return a
    // complete output manifest.
    

#12 Updated by Peter Amstutz almost 2 years ago

I ran the CWL test suite in arvbox using the latest crunch-run, this tests all kinds of weird crunch-run uploader cases. I got some errors:

Test [88/128] 
Test failed: /home/peter/work/scripts/venv/bin/arvados-cwl-runner --api=containers --compute-checksum --disable-reuse --outdir=/tmp/tmp6N9Bhe --quiet v1.0/stagefile.cwl v1.0/stagefile-job.yml
Test writable staged files.
Returned non-zero
2018-04-06 16:15:16 arvados.cwl-runner WARNING: Overall process status is permanentFail
2018-04-06 16:15:16 cwltool WARNING: Final process status is permanentFail

Test [89/128] 
Test [90/128] 
Test [91/128] 
Test [92/128] 
Test [93/128] 
Test [94/128] 
Test [95/128] 
Test [96/128] 
Test [97/128] 
Test [98/128] 
Test [99/128] 
Test [100/128] 
Test [101/128] 
Test [102/128] 
Test [103/128] 
Test [104/128] 
Test [105/128] 
Test [106/128] 
Test failed: /home/peter/work/scripts/venv/bin/arvados-cwl-runner --api=containers --compute-checksum --disable-reuse --outdir=/tmp/tmpxCQ1oW --quiet v1.0/recursive-input-directory.cwl v1.0/recursive-input-directory.yml
Test if a writable input directory is recursivly copied and writable
Returned non-zero
2018-04-06 16:21:46 arvados.cwl-runner WARNING: Overall process status is permanentFail
2018-04-06 16:21:46 cwltool WARNING: Final process status is permanentFail

Test [107/128] 
Test [108/128] 
Test [109/128] 
Test [110/128] 
Test [111/128] 
Test [112/128] 
Test [113/128] 
Test [114/128] 
Test [115/128] 
Test [116/128] 
Test failed: /home/peter/work/scripts/venv/bin/arvados-cwl-runner --api=containers --compute-checksum --disable-reuse --outdir=/tmp/tmpKfAbFr --quiet v1.0/writable-dir.cwl v1.0/empty.json
Test empty writable dir with InitialWorkDirRequirement
Returned non-zero
2018-04-06 16:26:03 arvados.cwl-runner WARNING: Overall process status is permanentFail
2018-04-06 16:26:03 cwltool WARNING: Final process status is permanentFail

Test [117/128] 
Test failed: /home/peter/work/scripts/venv/bin/arvados-cwl-runner --api=containers --compute-checksum --disable-reuse --outdir=/tmp/tmprRlgXg --quiet v1.0/writable-dir-docker.cwl v1.0/empty.json
Test empty writable dir with InitialWorkDirRequirement inside Docker
Returned non-zero
2018-04-06 16:26:24 arvados.cwl-runner WARNING: Overall process status is permanentFail
2018-04-06 16:26:24 cwltool WARNING: Final process status is permanentFail

Test [118/128] 
Test [119/128] 
Test [120/128] 
Test [121/128] 
Test failed: /home/peter/work/scripts/venv/bin/arvados-cwl-runner --api=containers --compute-checksum --disable-reuse --outdir=/tmp/tmp_aCW9A --quiet v1.0/iwdr_with_nested_dirs.cwl v1.0/empty.json
Test InitialWorkDirRequirement with a nested directory structure from another step
Returned non-zero
2018-04-06 16:28:08 arvados.cwl-runner WARNING: Overall process status is permanentFail
2018-04-06 16:28:08 cwltool WARNING: Final process status is permanentFail

I'll have to track down exactly what failed in these cases.

#13 Updated by Tom Clegg almost 2 years ago

"Test if a writable input directory is recursivly copied and writable" ... I'm guessing this is "writable collection mounted below output path" which currently fails ("cannot ...: writable collection mounted at ...") but it should be supported -- the files will have been copied into the output dir/collection rather than mounted so we just need to skip the mount treatment.

#14 Updated by Peter Amstutz almost 2 years ago

Tom Clegg wrote:

"Test if a writable input directory is recursivly copied and writable" ... I'm guessing this is "writable collection mounted below output path" which currently fails ("cannot ...: writable collection mounted at ...") but it should be supported -- the files will have been copied into the output dir/collection rather than mounted so we just need to skip the mount treatment.

Ok, rerunning tests.

#15 Updated by Tom Clegg almost 2 years ago

13100-crunch-run-output @ 4e7de9eae07a34115cbebb074880f10b74c077d6
  • handles writable collections mounted below output dir
  • handles symlinks to files/dirs in writable collections mounted elsewhere
  • adds test case for relative symlink to relative symlink (which seems to be the theory about #12606, although I haven't confirmed that scenario was broken before)

#16 Updated by Tom Clegg almost 2 years ago

  • Related to Bug #12606: Symlink in output points to invalid location -- no such file or directory added

#17 Updated by Peter Amstutz almost 2 years ago

I've now run a bunch of CWL tests on 13100-crunch-run-output and they all seem to be passing, so LGTM.

#18 Updated by Tom Clegg almost 2 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 50 to 100

#19 Updated by Tom Morris over 1 year ago

  • Release set to 13

Also available in: Atom PDF