Bug #21891
closedVery slow copying from collection into output in crunch-run copy phase
Description
Reported very slow copying files from collections to collections in output phase (~15s per file).
Example log:
2024-06-11T20:59:18.902328206Z copying "PT001379858_A.dedup.flagstat" from d75eba3924f3985128d7a2dbb7f59cf6+1786208/PT001379858_A.dedup.flagstat 2024-06-11T20:59:35.158068026Z copying "PT003617020_A.insertsizemetricalgo.metrics.txt" from d75eba3924f3985128d7a2dbb7f59cf6+1786208/PT003617020_A.insertsizemetricalgo.metrics.txt 2024-06-11T20:59:50.741935364Z copying "PT005899462_A.insertsizemetricalgo.metrics.txt" from d75eba3924f3985128d7a2dbb7f59cf6+1786208/PT005899462_A.insertsizemetricalgo.metrics.txt
We've had other reports of performance issues with the copying phase where it is copying from collections in the past, but I assumed it was because it was loading collections from the API server, and that was slow. In this case, these are from the same manifest, so it should be using manifestCache
(unless that's not working, for some reason, but monitoring also doesn't indicate substantial network traffic, so it doesn't appear to be spending most of its time fetching stuff).
Maybe the problem is actually the efficiency of the Extract()
method?
Also, appending strings repeatedly the way it is being done here is a notorious anti-pattern -- unless Go strings have some special buffer behavior, string appends like this require allocating a new buffer and copying the whole string in order to append text to the end, which tends to dominate runtime as the string gets larger.
cp.logger.Printf("copying %q from %v/%v", outputRelPath, srcMount.PortableDataHash, strings.TrimPrefix(srcRelPath, "./")) mft, err := cp.getManifest(srcMount.PortableDataHash) if err != nil { return err } cp.manifest += mft.Extract(srcRelPath, dest).Text
Files
Updated by Tom Clegg 6 months ago
- File flame-a60b48f5c1.png flame-a60b48f5c1.png added
- File flame-bb806a2834.png flame-bb806a2834.png added
Extract is indeed quite slow for large manifests. On a manifest with 40K files x 32 segments, extracting a single file takes ~10s, and most of that time is spent on this line.
if s.StreamName+"/"+name != target {
Before & after moving that comparison out of the loop:
21891-output-copying-performance @ bb806a2834d0a0320f767a536cb1c9854dad52c1 -- developer-run-tests: #4305
Meanwhile, collectionfs in sdk/go/arvados loads such a manifest in 0.7 seconds.
Given that, and the usage pattern here (extracting multiple files/dirs from one large manifest), it probably makes more sense for lib/crunchrun.copier to use arvados.Snapshot() and arvados.Splice(), and cache the collectionfs instead of re-parsing the manifest each time.
This appears to be the only place we use the Extract method.
There is one other place we use the sdk/go/manifest package for anything: lib/deduplicationreport uses it to build a map of {block hash -> block size}. I think we should remove it.
Updated by Tom Clegg 6 months ago
Another thing: in the "collection is writable" case (not the one being exercised in this bug report), it looks like crunchrun.copier reads the entire .arvados#collection
file from arv-mount for each file being copied, and arv-mount invalidates that file on each lookup, which causes it to redo the JSON encoding as well. Caching hostdir->manifest would skip a lot of this fuse traffic.
Updated by Tom Clegg 6 months ago
- Status changed from New to In Progress
- File flame-4f883aa130.png flame-4f883aa130.png added
- File flame-acc75dd485.png flame-acc75dd485.png added
- File flame-2914826a9a.png flame-2914826a9a.png added
Added two test cases to crunch-run (one readonly, one not, because those are different code paths). This time, used 10K files instead of 40K files like the above test, because 40K files took >10 minutes.
Unpatched Extract (including the fix from #note-6 above, but it turns out that doesn't actually help when extracting the entire tree): ~200s cpu, 114s clock time
Patching Extract with another string-concat fix (using bytes.Buffer instead of +=
to build a manifest): 10.2s cpu, 7.4s clock time
Replacing Extract with collectionfs Snapshot/Splice: 3.2s cpu, 1.3s clock time
21891-output-copying-performance @ 10cdbe63e7b41ee773261c904e2913d2b7bad034 -- developer-run-tests: #4307
Updated by Tom Clegg 6 months ago
...and removed sdk/go/manifest now that the only remaining usage was deduplicationreport, and only to parse out the block locators.
21891-output-copying-performance @ 15a36415708345e2361741606cc0e06bbd9e40c6 -- developer-run-tests: #4312
- All agreed upon points are implemented / addressed.
- ✅ sped up copying by a lot
- ✅ removed a bunch of now-unused code
- Anything not implemented (discovered or discussed during work) has a follow-up story.
- N/A
- Code is tested and passing, both automated and manual, what manual testing was done is described
- ✅ see timing results above
- Documentation has been updated.
- N/A
- Behaves appropriately at the intended scale (describe intended scale).
- ✅ crunch-run copier should now handle large collections just as well as webdav, etc.
- Considered backwards and forwards compatibility issues between client and server.
- N/A
- Follows our coding standards and GUI style guidelines.
- ✅
Updated by Brett Smith 6 months ago
Tom Clegg wrote in #note-9:
21891-output-copying-performance @ 15a36415708345e2361741606cc0e06bbd9e40c6 -- developer-run-tests: #4312
LGTM, thanks.
Updated by Brett Smith 6 months ago
Actually sorry, I did have just one small understanding question. In this part of copier_test.go
:
for i := 0; i < 100; i += 13 {
for j := 0; j < 100; j += 17 {
fnm := fmt.Sprintf("/fakecollection/dir%d/dir%d/file%d", i, j, j)
_, err := s.cp.staged.Stat(fnm)
c.Assert(err, check.IsNil, check.Commentf("%s", fnm))
}
}
Do 13 and 17 just represent arbitrary sampling strategies? The comment could maybe be a little clearer on their (non-)significance.
Updated by Tom Clegg 6 months ago
- Status changed from In Progress to Resolved
Applied in changeset arvados|4950ee8518e1e69a972c2ea40419d70284c1ce3d.
Updated by Brett Smith 2 months ago
- Related to Bug #22226: crunch-run segfault+panic when output directory includes a file mounted from a collection added