Project

General

Profile

Actions

Feature #16745

closed

[keep-web] Improve performance of S3 APIs using server-side cache

Added by Tom Clegg over 4 years ago. Updated over 3 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Keep
Target version:
Story points:
-
Release relationship:
Auto

Description

Currently, when handling a series of S3 API calls, keep-web builds a new filesystem object for each call. In some common workflows (reading several small files from a large collection / deep in a project tree), this will be very inefficient.

Initial implementation may be a map[apitoken]sitefs, using a single sitefs to serve many read-only requests until reaching a configured TTL or performing a write operation with the same token. This should be relatively simple, and although write operations would still be inefficient, sequences of read operations would be much faster.


Subtasks 1 (0 open1 closed)

Task #17379: Review 16745-keep-web-cacheResolvedTom Clegg03/02/2021Actions

Related issues 1 (0 open1 closed)

Related to Arvados Epics - Idea #16516: Run Keepstore on local compute nodesResolved10/01/202111/30/2021Actions
Actions #1

Updated by Tom Clegg over 4 years ago

  • Related to Idea #16360: Keep-web supports S3 compatible interface added
Actions #2

Updated by Peter Amstutz about 4 years ago

  • Related to Idea #16516: Run Keepstore on local compute nodes added
Actions #3

Updated by Peter Amstutz about 4 years ago

  • Related to deleted (Idea #16360: Keep-web supports S3 compatible interface)
Actions #4

Updated by Peter Amstutz almost 4 years ago

Would the caching scheme apply to all of S3, WebDAV, (future) FUSE since they mostly use the same Arvados Go SDK filesystem layer?

Actions #5

Updated by Tom Clegg almost 4 years ago

Peter Amstutz wrote:

Would the caching scheme apply to all of S3, WebDAV, (future) FUSE since they mostly use the same Arvados Go SDK filesystem layer?

There's a good chance we can use the same caching strategy for WebDAV with little extra work, yes.

FUSE doesn't have this problem: it already mounts one filesystem and uses it for the life of the fuse mount.

Actions #6

Updated by Tom Clegg almost 4 years ago

  • Target version changed from Arvados Future Sprints to 2021-02-17 sprint
Actions #7

Updated by Peter Amstutz almost 4 years ago

  • Assigned To set to Tom Clegg
Actions #8

Updated by Tom Clegg almost 4 years ago

  • Status changed from New to In Progress
Actions #9

Updated by Peter Amstutz almost 4 years ago

  • Target version changed from 2021-02-17 sprint to 2021-03-03 sprint
Actions #10

Updated by Tom Clegg almost 4 years ago

  • Target version changed from 2021-03-03 sprint to Arvados Future Sprints
  • Assigned To deleted (Tom Clegg)
  • Status changed from In Progress to New
Initial manual testing plan:
  • Make/find collection on ce8i5 that performs poorly when doing a series of s3 "get" requests on small files (should be easy to reproduce with a large-manifest collection inside a couple of levels of project -- s3://project_uuid/subproject_name/collection_name/file1.txt)
  • Deploy new version and set collection cache size greater than 2x the test manifest size
  • Confirm performance is better (there should be almost zero overhead per request)
  • Load files from several large-manifest collections and confirm keep-web memory use doesn't grow past the configured cache size
  • Configure a very small cache size like 1 MiB and confirm it can still serve requests (doesn't populate and then evict before using, for example)
Actions #11

Updated by Tom Clegg almost 4 years ago

  • Target version changed from Arvados Future Sprints to 2021-03-03 sprint
  • Assigned To set to Tom Clegg
  • Status changed from New to In Progress
Actions #12

Updated by Tom Clegg almost 4 years ago

16745-keep-web-cache @ 8c2bbecb1c09fbc3818dc1a2d73b3fda2ba68e02 -- developer-run-tests: #2338

Binary /home/tom/keep-web.8c2bbecb1 on keep.ce8i5.arvadosapi.com is available to test, in case anyone gets to it before I do.

The existing metric arvados_keepweb_collectioncache_cached_manifest_bytes now also includes the (approximate) memory used by the new session cache. This might not be the best way to express it (and ultimately the session cache should replace the collection cache entirely) but for the time being should at least make it easier to confirm the cache is behaving as expected.

Actions #13

Updated by Tom Clegg almost 4 years ago

quick manual testing on ce8i5

seq 2.2.0~dev20210228221303-1 16745-keep-web-cache 8c2bbecb1
1 s3://ce8i5-j7d0g-93r09l3eydh39fp/ 27s 49s
2 s3://ce8i5-j7d0g-93r09l3eydh39fp/ 34s 1s
3 s3://ce8i5-j7d0g-93r09l3eydh39fp/Project 1/ 43s 1s
4 s3://ce8i5-j7d0g-93r09l3eydh39fp/Project 1/ 42s 1s
5 s3://ce8i5-j7d0g-93r09l3eydh39fp/Project 1/Project 1 - 1/ 46s 1s
6 s3://ce8i5-j7d0g-93r09l3eydh39fp/Project 1/Project 1 - 1/ 48s 1s
7 s3://ce8i5-j7d0g-93r09l3eydh39fp/ 52s
Actions #14

Updated by Tom Clegg almost 4 years ago

The surprisingly slow responses appear to be caused by s3cmd doing a "get bucket location" request ("GET /bucket/?location"), which keep-web misinterprets as "list bucket contents with no delimiter", which involves recursively listing the entire project hierarchy.

It's possible the original bug report arose from a client that uses the same library/strategy, and more of the slowness experienced by the user was attributable to "get bucket location" calls than to the get/list calls.

Actions #15

Updated by Tom Clegg almost 4 years ago

With GetBucketLocation returning the cluster ID as the location, same sequence of tests:

seq
1 1.4s
2 0.6s
3 0.7s
4 0.6s
5 0.7s
6 0.5s
7 0.6s
Actions #16

Updated by Tom Clegg almost 4 years ago

16745-keep-web-cache @ 2fe25e2b32042098106acead136fd3064bab30e3 -- developer-run-tests: #2356

  • new session cache
  • handle GetBucketLocation API (location returned is the cluster ID)
  • return 400 for unsupported APIs like GetObjectAcl instead of mishandling them as GetObject
Actions #17

Updated by Ward Vandewege almost 4 years ago

Tom Clegg wrote:

16745-keep-web-cache @ 2fe25e2b32042098106acead136fd3064bab30e3 -- developer-run-tests: #2356

Looks like a transient wb intergration test failure, otherwise ok.

  • new session cache
  • handle GetBucketLocation API (location returned is the cluster ID)
  • return 400 for unsupported APIs like GetObjectAcl instead of mishandling them as GetObject

Review comment, just one thing:

  • the `cached_manifest_bytes` metric is awkwardly named now that it also includes session size. Maybe we should rename it to `cached_collection_bytes` (to line up with the `MaxCollectionBytes` configuration parameter)? We only just added this metric in 2.1.0, right? It's probably fine to rename it to be more accurately named.

Thanks, LGTM otherwise.

Actions #18

Updated by Tom Clegg almost 4 years ago

Renamed the metric to cached_collection_bytes. Also changed the namespace from keepweb_collectioncache to keepweb_sessions. We're planning to get rid of the "collection cache" stuff and move everything to session cache, so I figure we might as well avoid changing the name again later or leaving it with a mysteriously different name.

16745-keep-web-cache @ 30bfb516f3f5bc88c3d0c07e380496f31b65945f -- developer-run-tests: #2358

Actions #19

Updated by Ward Vandewege almost 4 years ago

Tom Clegg wrote:

Renamed the metric to cached_collection_bytes. Also changed the namespace from keepweb_collectioncache to keepweb_sessions. We're planning to get rid of the "collection cache" stuff and move everything to session cache, so I figure we might as well avoid changing the name again later or leaving it with a mysteriously different name.

16745-keep-web-cache @ 30bfb516f3f5bc88c3d0c07e380496f31b65945f -- developer-run-tests: #2358

Thanks, LGTM.

Actions #20

Updated by Tom Clegg almost 4 years ago

  • Target version changed from 2021-03-03 sprint to 2021-03-17 sprint
Actions #21

Updated by Tom Clegg almost 4 years ago

  • % Done changed from 0 to 100
  • Status changed from In Progress to Resolved
Actions #22

Updated by Ward Vandewege almost 4 years ago

  • Release set to 39
Actions #23

Updated by Ward Vandewege over 3 years ago

  • Release changed from 39 to 38
Actions

Also available in: Atom PDF