Bug #18051
closedInvestigate backend performance for WebDAV requests on very large collections
Description
In working on #17585 there are some questions as to whether the backend is slowing down the frontend.
Updated by Peter Amstutz about 3 years ago
- Status changed from New to In Progress
Updated by Tom Clegg about 3 years ago
It looks like RailsAPI is doing some CPU-intensive work after our code sends a collections#get response. Using wrong JSON encoder? Computing an Etag/hash? TBD.
It looks like keep-web is not effectively caching a large collection even when it's needed for consecutive requests. Default cache size too small? Depends on code path ("home" vs read-write)? TBD.
Meanwhile I found a few optimizations to speed up the manifest-to-collectionfs process by ~20% using a benchmark with 200K blocks and 4M files.- bytes.Split instead of strings.Split
- custom token-splitting func optimized for a 1-byte separator and reusing the same destination slice
- one reused slice for the file to be added (reduces allocs and repetitive calls to Split())
- skip manifestUnescape and Split on filenames that don't contain an escape char or dir separator
- don't copy modtimes from parents while building tree (which involves lock+unlock), just wait for backdateTree to fill in everything
benchmark time before optimizations: 10.7, 11.0, 10.8
benchmark time after optimizations: 8.7, 8.6, 8.4
18051-collectionfs @ 34239884a3eabd19ce02445c5582ba1102bbf4e8 -- developer-run-tests: #2685
Updated by Tom Clegg about 3 years ago
- Target version changed from 2021-09-15 sprint to 2021-09-29 sprint
Updated by Tom Clegg about 3 years ago
In #12216 we introduced a UUIDTTL
config (default 5s) to limit "collection state" cache time, so you could have a 5-minute cache TTL without having to wait 5 minutes between writing via arv-put/fuse and reading back via WebDAV. But we did this by expiring cached uuid-to-pdh lookup results quickly, which inadvertently made the entire cache TTL effectively 5s (if we the uuid-to-pdh lookup result isn't cached, we fetch the entire collection record).
What we really needed was a "revalidate" time on the uuid-to-pdh cache, which also happens to make the permission cache redundant.
Changes here:- use the UUIDTTL config as a "revalidate" time (we also might want to consider renaming it).
- remove permission cache
- serialize cache sweeps (previous code was prone to starting multiple concurrent sweeps, which is pointless)
Testing on ce8i5, a sequence of GET requests on a single large collection now takes 7s, 3s, 3s, 3s, 3s, ... instead of 7s, 7s, 7s, 7s, 7s, ... but note this also requires #18122, otherwise the "get current pdh for collection" shortcut retrieves the entire manifest from RailsAPI so it's not really a shortcut.
time curl --user none:$ARVADOS_API_TOKEN https://ce8i5-4zz18-3asw0byh6ta6y6p.collections.ce8i5.arvadosapi.com/foobar
18051-webdav-cache @ 02b35f7480e2792377e2ed23f740fff4b53badb9 -- developer-run-tests: #2690
Updated by Lucas Di Pentima about 3 years ago
Tom Clegg wrote:
Meanwhile I found a few optimizations to speed up the manifest-to-collectionfs process by ~20% using a benchmark with 200K blocks and 4M files.
- bytes.Split instead of strings.Split
- custom token-splitting func optimized for a 1-byte separator and reusing the same destination slice
- one reused slice for the file to be added (reduces allocs and repetitive calls to Split())
- skip manifestUnescape and Split on filenames that don't contain an escape char or dir separator
- don't copy modtimes from parents while building tree (which involves lock+unlock), just wait for backdateTree to fill in everything
On my testing VM (2GB RAM & 2 cores) the speed up was around 50% (from 58 to 28 secs).
This LGTM, thanks!
Updated by Lucas Di Pentima about 3 years ago
Tom Clegg wrote:
In #12216 we introduced a
UUIDTTL
config (default 5s) to limit "collection state" cache time, so you could have a 5-minute cache TTL without having to wait 5 minutes between writing via arv-put/fuse and reading back via WebDAV. But we did this by expiring cached uuid-to-pdh lookup results quickly, which inadvertently made the entire cache TTL effectively 5s (if we the uuid-to-pdh lookup result isn't cached, we fetch the entire collection record).
[...]
Testing on ce8i5, a sequence of GET requests on a single large collection now takes 7s, 3s, 3s, 3s, 3s, ... instead of 7s, 7s, 7s, 7s, 7s, ... but note this also requires #18122, otherwise the "get current pdh for collection" shortcut retrieves the entire manifest from RailsAPI so it's not really a shortcut.
This LGTM, thanks!
Updated by Lucas Di Pentima about 3 years ago
What do you think about using bytes.Split()
on arvados.SignLocator()
? Would it make a noticeable difference now that the signing is done on the controller?
Updated by Tom Clegg about 3 years ago
Lucas Di Pentima wrote:
What do you think about using
bytes.Split()
onarvados.SignLocator()
? Would it make a noticeable difference now that the signing is done on the controller?
Ah yeah. We can surely do more here, but removing strings.Split()[0]
here should avoid a lot of wasteful string/slice allocs.
18051-blob-signing @ 01e2aa185d373357bad711d916ff10103c48a89a -- developer-run-tests: #2697
Updated by Lucas Di Pentima about 3 years ago
Update at 70d89a76d
Out of curiosity, I've added a benchmarking test for SignManifest()
. The timings I got on my VM were as follows:
- Before: 8.2, 8.0, 9.1, 8.2, 8.1 (8.3 avg)
- After: 8.1, 7.9, 7.8, 7.9, 8.0 (7.9 avg)
I'm surprised that the improvement was just ~6%; maybe I wrote the benchmarking test wrong?
If this is a correct benchmarking, 6% is not zero so it LGTM.
Updated by Tom Clegg about 3 years ago
Lucas Di Pentima wrote:
I'm surprised that the improvement was just ~6%; maybe I wrote the benchmarking test wrong?
The "bigmanifest" has 4M files but only 200K block locators. This change will probably make a bigger difference in manifests with fewer+larger files.
Updated by Tom Clegg about 3 years ago
- Status changed from In Progress to Resolved