Bug #18051

Investigate backend performance for WebDAV requests on very large collections

Added by Peter Amstutz 26 days ago. Updated 4 days ago.

Status:
In Progress
Priority:
Normal
Assigned To:
Category:
-
Target version:
Start date:
09/16/2021
Due date:
% Done:

0%

Estimated time:
(Total: 0.00 h)
Story points:
-

Description

In working on #17585 there are some questions as to whether the backend is slowing down the frontend.


Subtasks

Task #18082: Review 18051-collectionfsIn ProgressTom Clegg

Task #18160: Review 18051-webdav-cacheIn ProgressTom Clegg

History

#1 Updated by Peter Amstutz 26 days ago

  • Status changed from New to In Progress

#2 Updated by Peter Amstutz 26 days ago

  • Description updated (diff)

#3 Updated by Peter Amstutz 19 days ago

  • Assigned To set to Tom Clegg

#4 Updated by Tom Clegg 9 days ago

It looks like RailsAPI is doing some CPU-intensive work after our code sends a collections#get response. Using wrong JSON encoder? Computing an Etag/hash? TBD.

It looks like keep-web is not effectively caching a large collection even when it's needed for consecutive requests. Default cache size too small? Depends on code path ("home" vs read-write)? TBD.

Meanwhile I found a few optimizations to speed up the manifest-to-collectionfs process by ~20% using a benchmark with 200K blocks and 4M files.
  • bytes.Split instead of strings.Split
  • custom token-splitting func optimized for a 1-byte separator and reusing the same destination slice
  • one reused slice for the file to be added (reduces allocs and repetitive calls to Split())
  • skip manifestUnescape and Split on filenames that don't contain an escape char or dir separator
  • don't copy modtimes from parents while building tree (which involves lock+unlock), just wait for backdateTree to fill in everything

benchmark time before optimizations: 10.7, 11.0, 10.8

benchmark time after optimizations: 8.7, 8.6, 8.4

18051-collectionfs @ 34239884a3eabd19ce02445c5582ba1102bbf4e8 -- https://ci.arvados.org/view/Developer/job/developer-run-tests/2685/

#5 Updated by Tom Clegg 5 days ago

  • Target version changed from 2021-09-15 sprint to 2021-09-29 sprint

#6 Updated by Tom Clegg 4 days ago

In #12216 we introduced a UUIDTTL config (default 5s) to limit "collection state" cache time, so you could have a 5-minute cache TTL without having to wait 5 minutes between writing via arv-put/fuse and reading back via WebDAV. But we did this by expiring cached uuid-to-pdh lookup results quickly, which inadvertently made the entire cache TTL effectively 5s (if we the uuid-to-pdh lookup result isn't cached, we fetch the entire collection record).

What we really needed was a "revalidate" time on the uuid-to-pdh cache, which also happens to make the permission cache redundant.

Changes here:
  • use the UUIDTTL config as a "revalidate" time (we also might want to consider renaming it).
  • remove permission cache
  • serialize cache sweeps (previous code was prone to starting multiple concurrent sweeps, which is pointless)

Testing on ce8i5, a sequence of GET requests on a single large collection now takes 7s, 3s, 3s, 3s, 3s, ... instead of 7s, 7s, 7s, 7s, 7s, ... but note this also requires #18122, otherwise the "get current pdh for collection" shortcut retrieves the entire manifest from RailsAPI so it's not really a shortcut.

time curl --user none:$ARVADOS_API_TOKEN https://ce8i5-4zz18-3asw0byh6ta6y6p.collections.ce8i5.arvadosapi.com/foobar

18051-webdav-cache @ 02b35f7480e2792377e2ed23f740fff4b53badb9 -- https://ci.arvados.org/view/Developer/job/developer-run-tests/2690/

#7 Updated by Lucas Di Pentima 4 days ago

Tom Clegg wrote:

Meanwhile I found a few optimizations to speed up the manifest-to-collectionfs process by ~20% using a benchmark with 200K blocks and 4M files.
  • bytes.Split instead of strings.Split
  • custom token-splitting func optimized for a 1-byte separator and reusing the same destination slice
  • one reused slice for the file to be added (reduces allocs and repetitive calls to Split())
  • skip manifestUnescape and Split on filenames that don't contain an escape char or dir separator
  • don't copy modtimes from parents while building tree (which involves lock+unlock), just wait for backdateTree to fill in everything

On my testing VM (2GB RAM & 2 cores) the speed up was around 50% (from 58 to 28 secs).

This LGTM, thanks!

#8 Updated by Lucas Di Pentima 4 days ago

Tom Clegg wrote:

In #12216 we introduced a UUIDTTL config (default 5s) to limit "collection state" cache time, so you could have a 5-minute cache TTL without having to wait 5 minutes between writing via arv-put/fuse and reading back via WebDAV. But we did this by expiring cached uuid-to-pdh lookup results quickly, which inadvertently made the entire cache TTL effectively 5s (if we the uuid-to-pdh lookup result isn't cached, we fetch the entire collection record).

[...]

Testing on ce8i5, a sequence of GET requests on a single large collection now takes 7s, 3s, 3s, 3s, 3s, ... instead of 7s, 7s, 7s, 7s, 7s, ... but note this also requires #18122, otherwise the "get current pdh for collection" shortcut retrieves the entire manifest from RailsAPI so it's not really a shortcut.

This LGTM, thanks!

#9 Updated by Lucas Di Pentima 4 days ago

What do you think about using bytes.Split() on arvados.SignLocator()? Would it make a noticeable difference now that the signing is done on the controller?

Also available in: Atom PDF