Feature #16745

[keep-web] Improve performance of S3 APIs using server-side cache

Added by Tom Clegg 8 months ago. Updated about 1 month ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Keep
Target version:
Start date:
03/02/2021
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
-
Release relationship:
Auto

Description

Currently, when handling a series of S3 API calls, keep-web builds a new filesystem object for each call. In some common workflows (reading several small files from a large collection / deep in a project tree), this will be very inefficient.

Initial implementation may be a map[apitoken]sitefs, using a single sitefs to serve many read-only requests until reaching a configured TTL or performing a write operation with the same token. This should be relatively simple, and although write operations would still be inefficient, sequences of read operations would be much faster.


Subtasks

Task #17379: Review 16745-keep-web-cacheResolvedTom Clegg


Related issues

Related to Arvados Epics - Story #16516: Run Keepstore on local compute nodesNew07/01/202110/31/2021

Associated revisions

Revision a76b52ff
Added by Tom Clegg about 2 months ago

Merge branch '16745-keep-web-cache'

fixes #16745

Arvados-DCO-1.1-Signed-off-by: Tom Clegg <>

History

#1 Updated by Tom Clegg 8 months ago

  • Related to Story #16360: Keep-web supports S3 compatible interface added

#2 Updated by Peter Amstutz 4 months ago

  • Related to Story #16516: Run Keepstore on local compute nodes added

#3 Updated by Peter Amstutz 4 months ago

  • Related to deleted (Story #16360: Keep-web supports S3 compatible interface)

#4 Updated by Peter Amstutz 2 months ago

Would the caching scheme apply to all of S3, WebDAV, (future) FUSE since they mostly use the same Arvados Go SDK filesystem layer?

#5 Updated by Tom Clegg 2 months ago

Peter Amstutz wrote:

Would the caching scheme apply to all of S3, WebDAV, (future) FUSE since they mostly use the same Arvados Go SDK filesystem layer?

There's a good chance we can use the same caching strategy for WebDAV with little extra work, yes.

FUSE doesn't have this problem: it already mounts one filesystem and uses it for the life of the fuse mount.

#6 Updated by Tom Clegg 2 months ago

  • Target version changed from Arvados Future Sprints to 2021-02-17 sprint

#7 Updated by Peter Amstutz 2 months ago

  • Assigned To set to Tom Clegg

#8 Updated by Tom Clegg about 2 months ago

  • Status changed from New to In Progress

#9 Updated by Peter Amstutz about 2 months ago

  • Target version changed from 2021-02-17 sprint to 2021-03-03 sprint

#10 Updated by Tom Clegg about 2 months ago

  • Target version changed from 2021-03-03 sprint to Arvados Future Sprints
  • Assigned To deleted (Tom Clegg)
  • Status changed from In Progress to New
Initial manual testing plan:
  • Make/find collection on ce8i5 that performs poorly when doing a series of s3 "get" requests on small files (should be easy to reproduce with a large-manifest collection inside a couple of levels of project -- s3://project_uuid/subproject_name/collection_name/file1.txt)
  • Deploy new version and set collection cache size greater than 2x the test manifest size
  • Confirm performance is better (there should be almost zero overhead per request)
  • Load files from several large-manifest collections and confirm keep-web memory use doesn't grow past the configured cache size
  • Configure a very small cache size like 1 MiB and confirm it can still serve requests (doesn't populate and then evict before using, for example)

#11 Updated by Tom Clegg about 2 months ago

  • Target version changed from Arvados Future Sprints to 2021-03-03 sprint
  • Assigned To set to Tom Clegg
  • Status changed from New to In Progress

#12 Updated by Tom Clegg about 2 months ago

16745-keep-web-cache @ 8c2bbecb1c09fbc3818dc1a2d73b3fda2ba68e02 -- https://ci.arvados.org/view/Developer/job/developer-run-tests/2338/

Binary /home/tom/keep-web.8c2bbecb1 on keep.ce8i5.arvadosapi.com is available to test, in case anyone gets to it before I do.

The existing metric arvados_keepweb_collectioncache_cached_manifest_bytes now also includes the (approximate) memory used by the new session cache. This might not be the best way to express it (and ultimately the session cache should replace the collection cache entirely) but for the time being should at least make it easier to confirm the cache is behaving as expected.

#13 Updated by Tom Clegg about 2 months ago

quick manual testing on ce8i5

seq 2.2.0~dev20210228221303-1 16745-keep-web-cache 8c2bbecb1
1 s3://ce8i5-j7d0g-93r09l3eydh39fp/ 27s 49s
2 s3://ce8i5-j7d0g-93r09l3eydh39fp/ 34s 1s
3 s3://ce8i5-j7d0g-93r09l3eydh39fp/Project 1/ 43s 1s
4 s3://ce8i5-j7d0g-93r09l3eydh39fp/Project 1/ 42s 1s
5 s3://ce8i5-j7d0g-93r09l3eydh39fp/Project 1/Project 1 - 1/ 46s 1s
6 s3://ce8i5-j7d0g-93r09l3eydh39fp/Project 1/Project 1 - 1/ 48s 1s
7 s3://ce8i5-j7d0g-93r09l3eydh39fp/ 52s

#14 Updated by Tom Clegg about 2 months ago

The surprisingly slow responses appear to be caused by s3cmd doing a "get bucket location" request ("GET /bucket/?location"), which keep-web misinterprets as "list bucket contents with no delimiter", which involves recursively listing the entire project hierarchy.

It's possible the original bug report arose from a client that uses the same library/strategy, and more of the slowness experienced by the user was attributable to "get bucket location" calls than to the get/list calls.

#15 Updated by Tom Clegg about 2 months ago

With GetBucketLocation returning the cluster ID as the location, same sequence of tests:

seq
1 1.4s
2 0.6s
3 0.7s
4 0.6s
5 0.7s
6 0.5s
7 0.6s

#16 Updated by Tom Clegg about 2 months ago

16745-keep-web-cache @ 2fe25e2b32042098106acead136fd3064bab30e3 -- https://ci.arvados.org/view/Developer/job/developer-run-tests/2356/

  • new session cache
  • handle GetBucketLocation API (location returned is the cluster ID)
  • return 400 for unsupported APIs like GetObjectAcl instead of mishandling them as GetObject

#17 Updated by Ward Vandewege about 2 months ago

Tom Clegg wrote:

16745-keep-web-cache @ 2fe25e2b32042098106acead136fd3064bab30e3 -- https://ci.arvados.org/view/Developer/job/developer-run-tests/2356/

Looks like a transient wb intergration test failure, otherwise ok.

  • new session cache
  • handle GetBucketLocation API (location returned is the cluster ID)
  • return 400 for unsupported APIs like GetObjectAcl instead of mishandling them as GetObject

Review comment, just one thing:

  • the `cached_manifest_bytes` metric is awkwardly named now that it also includes session size. Maybe we should rename it to `cached_collection_bytes` (to line up with the `MaxCollectionBytes` configuration parameter)? We only just added this metric in 2.1.0, right? It's probably fine to rename it to be more accurately named.

Thanks, LGTM otherwise.

#18 Updated by Tom Clegg about 2 months ago

Renamed the metric to cached_collection_bytes. Also changed the namespace from keepweb_collectioncache to keepweb_sessions. We're planning to get rid of the "collection cache" stuff and move everything to session cache, so I figure we might as well avoid changing the name again later or leaving it with a mysteriously different name.

16745-keep-web-cache @ 30bfb516f3f5bc88c3d0c07e380496f31b65945f -- https://ci.arvados.org/view/Developer/job/developer-run-tests/2358/

#19 Updated by Ward Vandewege about 2 months ago

Tom Clegg wrote:

Renamed the metric to cached_collection_bytes. Also changed the namespace from keepweb_collectioncache to keepweb_sessions. We're planning to get rid of the "collection cache" stuff and move everything to session cache, so I figure we might as well avoid changing the name again later or leaving it with a mysteriously different name.

16745-keep-web-cache @ 30bfb516f3f5bc88c3d0c07e380496f31b65945f -- https://ci.arvados.org/view/Developer/job/developer-run-tests/2358/

Thanks, LGTM.

#20 Updated by Tom Clegg about 2 months ago

  • Target version changed from 2021-03-03 sprint to 2021-03-17 sprint

#21 Updated by Tom Clegg about 2 months ago

  • % Done changed from 0 to 100
  • Status changed from In Progress to Resolved

#22 Updated by Ward Vandewege about 1 month ago

  • Release set to 39

Also available in: Atom PDF