Project

General

Profile

Actions

Bug #21891

closed

Very slow copying from collection into output in crunch-run copy phase

Added by Peter Amstutz 20 days ago. Updated 6 days ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Keep
Story points:
-
Release:
Release relationship:
Auto

Description

Reported very slow copying files from collections to collections in output phase (~15s per file).

Example log:

2024-06-11T20:59:18.902328206Z copying "PT001379858_A.dedup.flagstat" from d75eba3924f3985128d7a2dbb7f59cf6+1786208/PT001379858_A.dedup.flagstat
2024-06-11T20:59:35.158068026Z copying "PT003617020_A.insertsizemetricalgo.metrics.txt" from d75eba3924f3985128d7a2dbb7f59cf6+1786208/PT003617020_A.insertsizemetricalgo.metrics.txt
2024-06-11T20:59:50.741935364Z copying "PT005899462_A.insertsizemetricalgo.metrics.txt" from d75eba3924f3985128d7a2dbb7f59cf6+1786208/PT005899462_A.insertsizemetricalgo.metrics.txt

We've had other reports of performance issues with the copying phase where it is copying from collections in the past, but I assumed it was because it was loading collections from the API server, and that was slow. In this case, these are from the same manifest, so it should be using manifestCache (unless that's not working, for some reason, but monitoring also doesn't indicate substantial network traffic, so it doesn't appear to be spending most of its time fetching stuff).

Maybe the problem is actually the efficiency of the Extract() method?

Also, appending strings repeatedly the way it is being done here is a notorious anti-pattern -- unless Go strings have some special buffer behavior, string appends like this require allocating a new buffer and copying the whole string in order to append text to the end, which tends to dominate runtime as the string gets larger.

        cp.logger.Printf("copying %q from %v/%v", outputRelPath, srcMount.PortableDataHash, strings.TrimPrefix(srcRelPath, "./"))
        mft, err := cp.getManifest(srcMount.PortableDataHash)
        if err != nil {
            return err
        }
        cp.manifest += mft.Extract(srcRelPath, dest).Text

Files

flame-a60b48f5c1.png (51.7 KB) flame-a60b48f5c1.png Tom Clegg, 06/21/2024 12:46 PM
flame-bb806a2834.png (98.2 KB) flame-bb806a2834.png Tom Clegg, 06/21/2024 12:46 PM
flame-2914826a9a.png (143 KB) flame-2914826a9a.png Tom Clegg, 06/23/2024 01:14 PM
flame-acc75dd485.png (124 KB) flame-acc75dd485.png Tom Clegg, 06/23/2024 01:14 PM
flame-4f883aa130.png (149 KB) flame-4f883aa130.png Tom Clegg, 06/23/2024 01:35 PM

Subtasks 1 (0 open1 closed)

Task #21914: Review 21891-output-copying-performanceResolvedTom Clegg06/25/2024Actions
Actions #1

Updated by Peter Amstutz 20 days ago

  • Description updated (diff)
Actions #2

Updated by Peter Amstutz 20 days ago

  • Description updated (diff)
Actions #3

Updated by Peter Amstutz 14 days ago

  • Release set to 70
Actions #4

Updated by Peter Amstutz 14 days ago

  • Category changed from Crunch to Keep
Actions #5

Updated by Peter Amstutz 13 days ago

  • Assigned To set to Tom Clegg
Actions #6

Updated by Tom Clegg 11 days ago

Extract is indeed quite slow for large manifests. On a manifest with 40K files x 32 segments, extracting a single file takes ~10s, and most of that time is spent on this line.

               if s.StreamName+"/"+name != target {

Before & after moving that comparison out of the loop:

21891-output-copying-performance @ bb806a2834d0a0320f767a536cb1c9854dad52c1 -- developer-run-tests: #4305

Meanwhile, collectionfs in sdk/go/arvados loads such a manifest in 0.7 seconds.

Given that, and the usage pattern here (extracting multiple files/dirs from one large manifest), it probably makes more sense for lib/crunchrun.copier to use arvados.Snapshot() and arvados.Splice(), and cache the collectionfs instead of re-parsing the manifest each time.

This appears to be the only place we use the Extract method.

There is one other place we use the sdk/go/manifest package for anything: lib/deduplicationreport uses it to build a map of {block hash -> block size}. I think we should remove it.

Actions #7

Updated by Tom Clegg 11 days ago

Another thing: in the "collection is writable" case (not the one being exercised in this bug report), it looks like crunchrun.copier reads the entire .arvados#collection file from arv-mount for each file being copied, and arv-mount invalidates that file on each lookup, which causes it to redo the JSON encoding as well. Caching hostdir->manifest would skip a lot of this fuse traffic.

Actions #8

Updated by Tom Clegg 9 days ago

Added two test cases to crunch-run (one readonly, one not, because those are different code paths). This time, used 10K files instead of 40K files like the above test, because 40K files took >10 minutes.

Unpatched Extract (including the fix from #note-6 above, but it turns out that doesn't actually help when extracting the entire tree): ~200s cpu, 114s clock time

Patching Extract with another string-concat fix (using bytes.Buffer instead of += to build a manifest): 10.2s cpu, 7.4s clock time

Replacing Extract with collectionfs Snapshot/Splice: 3.2s cpu, 1.3s clock time

21891-output-copying-performance @ 10cdbe63e7b41ee773261c904e2913d2b7bad034 -- developer-run-tests: #4307

Actions #9

Updated by Tom Clegg 8 days ago

...and removed sdk/go/manifest now that the only remaining usage was deduplicationreport, and only to parse out the block locators.

21891-output-copying-performance @ 15a36415708345e2361741606cc0e06bbd9e40c6 -- developer-run-tests: #4312

  • All agreed upon points are implemented / addressed.
    • ✅ sped up copying by a lot
    • ✅ removed a bunch of now-unused code
  • Anything not implemented (discovered or discussed during work) has a follow-up story.
    • N/A
  • Code is tested and passing, both automated and manual, what manual testing was done is described
    • ✅ see timing results above
  • Documentation has been updated.
    • N/A
  • Behaves appropriately at the intended scale (describe intended scale).
    • ✅ crunch-run copier should now handle large collections just as well as webdav, etc.
  • Considered backwards and forwards compatibility issues between client and server.
    • N/A
  • Follows our coding standards and GUI style guidelines.
Actions #10

Updated by Brett Smith 7 days ago

Tom Clegg wrote in #note-9:

21891-output-copying-performance @ 15a36415708345e2361741606cc0e06bbd9e40c6 -- developer-run-tests: #4312

LGTM, thanks.

Actions #11

Updated by Brett Smith 7 days ago

Actually sorry, I did have just one small understanding question. In this part of copier_test.go:

for i := 0; i < 100; i += 13 {
    for j := 0; j < 100; j += 17 {
        fnm := fmt.Sprintf("/fakecollection/dir%d/dir%d/file%d", i, j, j)
        _, err := s.cp.staged.Stat(fnm)
        c.Assert(err, check.IsNil, check.Commentf("%s", fnm))
    }
}

Do 13 and 17 just represent arbitrary sampling strategies? The comment could maybe be a little clearer on their (non-)significance.

Actions #12

Updated by Tom Clegg 7 days ago

Do 13 and 17 just represent arbitrary sampling strategies? The comment could maybe be a little clearer on their (non-)significance.

Yes. Added to comment.

Actions #13

Updated by Tom Clegg 7 days ago

  • Status changed from In Progress to Resolved
Actions

Also available in: Atom PDF