Project

General

Profile

Actions

Feature #20318

closed

Go SDK supports local filesystem-backed data cache

Added by Peter Amstutz 12 months ago. Updated 2 months ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Keep
Story points:
-

Description

Add option to store cached blocks in local filesystem rather than RAM in order to take advantage of better performance of (typically SSD-backed) local disk relative to backend, and let the kernel decide what to cache in RAM.

Where reasonable, implement the same behavior as the Python SDK in laying out and managing the cache directory containing the files for each cached block.

Caller should be able to read a partial block before the entire block is received from upstream, allowing for streaming block content. Can base this on AsyncBuf but uses a file or mmap as the backing store instead of a plain slice.


Subtasks 1 (0 open1 closed)

Task #21177: Review 20318-disk-cacheResolvedTom Clegg12/29/2023Actions

Related issues

Related to Arvados Epics - Idea #18342: Keep performance optimizationNew08/01/202303/31/2024Actions
Related to Arvados - Feature #21196: implement keepclient API on top of block cache with pluggable backendResolvedTom CleggActions
Related to Arvados - Idea #21323: System services use cache/config directories indicated by XDG env vars / systemd directivesNewActions
Related to Arvados - Bug #21428: keepclient test StandaloneSuite.TestPutHR fails intermittentlyResolvedTom CleggActions
Actions #1

Updated by Peter Amstutz 9 months ago

  • Related to Idea #17001: Arvados uses WB2 by default added
Actions #2

Updated by Peter Amstutz 9 months ago

  • Related to deleted (Idea #17001: Arvados uses WB2 by default)
Actions #3

Updated by Peter Amstutz 9 months ago

  • Related to Idea #18342: Keep performance optimization added
Actions #4

Updated by Peter Amstutz 9 months ago

  • Description updated (diff)
Actions #5

Updated by Peter Amstutz 6 months ago

  • Target version changed from Future to Development 2023-11-08 sprint
Actions #6

Updated by Peter Amstutz 6 months ago

  • Target version changed from Development 2023-11-08 sprint to Development 2023-11-29 sprint
Actions #7

Updated by Peter Amstutz 6 months ago

  • Target version changed from Development 2023-11-29 sprint to Development 2024-01-03 sprint
Actions #8

Updated by Peter Amstutz 6 months ago

  • Target version changed from Development 2024-01-03 sprint to Development 2023-11-29 sprint
Actions #9

Updated by Tom Clegg 6 months ago

  • Description updated (diff)
  • Subject changed from Go SDK keep cache supports memory mapped block cache to Go SDK supports local filesystem-backed data cache
Actions #10

Updated by Peter Amstutz 5 months ago

  • Target version changed from Development 2023-11-29 sprint to Development 2023-10-25 sprint
Actions #11

Updated by Peter Amstutz 5 months ago

  • Assigned To set to Tom Clegg
Actions #12

Updated by Peter Amstutz 5 months ago

  • Target version changed from Development 2023-10-25 sprint to Development 2023-11-08 sprint
Actions #13

Updated by Peter Amstutz 5 months ago

  • Target version changed from Development 2023-11-08 sprint to Development 2023-11-29 sprint
Actions #14

Updated by Peter Amstutz 4 months ago

  • Description updated (diff)
Actions #15

Updated by Peter Amstutz 4 months ago

  • Related to Feature #21196: implement keepclient API on top of block cache with pluggable backend added
Actions #16

Updated by Peter Amstutz 4 months ago

  • Status changed from New to In Progress
Actions #17

Updated by Peter Amstutz 4 months ago

  • Target version changed from Development 2023-11-29 sprint to Development 2024-01-03 sprint
Actions #18

Updated by Tom Clegg 3 months ago

20318-disk-cache @ 461fdaa1b96142b8065c131ae0334046fc71ea56 -- developer-run-tests: #3966

  • ✅ All agreed upon points are implemented / addressed.
    • Adds filesystem-backed cache
    • Cache dir is /var/cache/arvados/keep when running as root (should we update Python SDK to do this too?)
    • Cache dir is ~/.cache/arvados/keep when not running as root (same as Python SDK)
    • Default cache size is 10% of disk; if disk size cannot be determined, default size is min(1 GB, current_cache_usage)
    • Implements streaming writes (no store & forward delay / extra buffer memory)
    • Implements streaming reads (entire blocks are retrieved from the backend in the background; incoming partial read requests are returned as soon as the relevant part has been retrieved)
    • Reduces open/close overhead by keeping up to 10K filehandles open (or 1/4 of NOFILES if smaller -- but NOFILES is typically 1M because the Go runtime automatically raises it to hardlimit)
    • Gracefully handles cache files disappearing (e.g., if a Python program is using the same cache dir)
  • Anything not implemented (discovered or discussed during work) has a follow-up story.
    • ☐ todo: config entry for disk cache size on server hosts (currently only keep-web will use it)
    • ☐ todo: arvados-client mount command line flag for disk cache size
  • ✅ Code is tested and passing, both automated and manual, what manual testing was done is described
    • Did some manual testing with arvados-client mount + tordo to confirm that streaming really works. It worked.
  • Documentation has been updated.
    • ☐ todo: mention that /var/cache/arvados/keep must be writable/creatable on server nodes
  • Behaves appropriately at the intended scale (describe intended scale).
    • See below
  • ✅ Considered backwards and forwards compatibility issues between client and server.
    • n/a
  • ✅ Follows our coding standards and GUI style guidelines.
The cache gets tidied after writing a cache file when the estimated disk usage exceeds the specified maximum, or the estimate is 5 minutes out of date.
  • Tidying means: walk cache dir, and if actual usage exceeds the specified maximum, delete old files until usage is 5% below the specified maximum.
  • The 5% rule is meant to avoid excessive tidying operations (write a block, walk the whole cache dir, delete one old block, write another block, walk the whole cache dir again, ...).
  • The 5 minute rule accounts for the fact that the estimated disk usage can drift from actual usage for various reasons.

If the cache size is 200 GiB and the average block size is 4 MiB, the "tidy" goroutine will traverse a cache dir structure with 51K files in 4K subdirectories. This seems manageable, but it would be wasteful if there are significant periods with very little activity. We could tweak the tidying-on-timer logic -- e.g., rather than use clock time, use the number of cache files: if there are N cache files, tidy the cache once every N/100 cache file writes even if our estimated size isn't above max.

Actions #19

Updated by Brett Smith 3 months ago

Tom Clegg wrote in #note-18:

  • Cache dir is /var/cache/arvados/keep when running as root (should we update Python SDK to do this too?)
  • Cache dir is ~/.cache/arvados/keep when not running as root (same as Python SDK)

I think this is fine, but to your question: systemd lets service definitions declare what directories they want to use for cache, configuration, etc. It works pretty similarly to this, except the distinction is between system service/user service instead of root/non-root.

I think the ideal place to get all our software would be:

  • If the systemd $*_DIRECTORY variable is set, use that.
  • Otherwise, if the XDG $XDG_*_HOME/$XDG_*_DIR variable is set, use that. (See #21020)
  • Otherwise, default as your logic above.
  • Update our systemd unit files to use the *Directory directives.

If there's agreement that's a good direction to go, this could be broken out into any number of follow-up stories.

Actions #20

Updated by Tom Clegg 3 months ago

(updated version of #note-18)

20318-disk-cache @ ec8d4899d74ea76c30cb394dfcda6a54cd9e6652 -- developer-run-tests: #3975

  • ✅ All agreed upon points are implemented / addressed.
    • Adds filesystem-backed cache
    • Cache dir is /var/cache/arvados/keep when running as root (should we update Python SDK to do this too?)
    • Cache dir is ~/.cache/arvados/keep when not running as root (same as Python SDK)
    • Default cache size is 10% of disk; if disk size cannot be determined, default size is min(1 GB, current_cache_usage)
    • Implements streaming writes (no store & forward delay / extra buffer memory)
    • Implements streaming reads (entire blocks are retrieved from the backend in the background; incoming partial read requests are returned as soon as the relevant part has been retrieved)
    • Reduces open/close overhead by keeping up to 10K filehandles open (or 1/40 of NOFILES if smaller -- but NOFILES is typically 1M because the Go runtime automatically raises it to hardlimit)
    • Internal cache state is shared across all keepclient objects, tokens, etc., so a program that accesses data on multiple clusters will use one pool of 10K filehandles
    • Gracefully handles cache files disappearing (e.g., if a Python program is using the same cache dir)
    • DiskCacheSize config entry for disk cache size on server hosts (only keep-web uses it, so it's in the existing WebDAVCache section), accepts values like 5% or 5GiB
    • arvados-client mount -cache-size=5% command line flag
  • Anything not implemented (discovered or discussed during work) has a follow-up story.
  • ✅ Code is tested and passing, both automated and manual, what manual testing was done is described
    • Did some manual testing with arvados-client mount + tordo to confirm that streaming really works. It worked.
  • Documentation has been updated.
    • ✅ upgrade notes
    • ✅ keep-web install page
  • Behaves appropriately at the intended scale (describe intended scale).
    • See below
  • ✅ Considered backwards and forwards compatibility issues between client and server.
    • n/a
  • ✅ Follows our coding standards and GUI style guidelines.
The cache gets tidied after writing a cache file when the estimated disk usage exceeds the specified maximum, or there have been N/100 files written since last tidy (where N is the number of cache files left after last tidy).
  • Tidying means: walk cache dir, and if actual usage exceeds the specified maximum, delete old files until usage is 5% below the specified maximum.
  • The 5% rule is meant to avoid excessive tidying operations (write a block, walk the whole cache dir, delete one old block, write another block, walk the whole cache dir again, ...).
  • The N/100 rule accounts for the fact that the estimated disk usage can drift from actual usage for various reasons, like multiple processes sharing the same cache dir.
  • If the cache size is 200 GiB and the average block size is 4 MiB, the "tidy" goroutine will traverse a cache dir structure with 51K files in 4K subdirectories. The N/100 rule would cause this to happen after every 500 file writes (500 * 4 MiB = 2 GiB). Is it safe to say traversing a tree of 51K files is a small enough amount of work relative to 2 GiB of write activity?
Actions #21

Updated by Tom Clegg 3 months ago

  • Related to Idea #21323: System services use cache/config directories indicated by XDG env vars / systemd directives added
Actions #22

Updated by Peter Amstutz 3 months ago

  • Target version changed from Development 2024-01-03 sprint to Development 2024-01-17 sprint
Actions #23

Updated by Peter Amstutz 3 months ago

func (cache *DiskCache) BlockWrite

+    go func() {
+        defer tmpfile.Close()
+        defer os.Remove(tmpfilename)

Looks like this will call os.Remove(tmpfilename) whether the file was renamed or not. It also seems like it will double-call tmpfile.Close(). Since the errors are ignored it doesn't fail, but it seems like it is possible (albeit with low probability) that those calls could do something unwanted. Might benefit from a comment explaining why nothing bad is going to happen, or being more defensive and setting tmpfilename / tmpfile to nil once they've been cleaned up, and checking for that in the defer.

func (cache *DiskCache) quickReadAt

+    heldopen := cache.heldopen[cachefilename]
+    if heldopen == nil {
+        isnew = true
+        heldopen = &openFileEnt{}
+        if cache.heldopen == nil {
+            cache.heldopen = make(map[string]*openFileEnt, cache.heldopenMax)

If cache.heldopen is nil, wouldn't the first line (cache.heldopen[cachefilename]) panic before we get to the last line where you assign cache.heldopen ? Or does Go semantics allow you to do lookups on nil maps?

+            // Rather than go to the trouble of tracking
+            // last access time, just close all files, and
+            // open again as needed. Even in the worst

It's fine not to want to fuss around with a LRU stack or check atimes, but instead of closing every single file wouldn't it be more efficient to pick a few at random to evict? Although I guess the assumption here is that this happens infrequently, and it is more likely we are evicting blocks because of too much disk usage than too many file handles.

+        // Open and flock the file, then call wg.Done() to
+        // unblock any other goroutines that are waiting in
+        // the !isnew case above.

This comment seems out of date since there's no call to wg.Done().

            cache.writing = map[string]*writeprogress{}

I thought you had to call `make(map[string]*writeprogress)`, is this a new syntax? Or is the behavior different somehow?

func (cache *DiskCache) ReadAt

            defer func() {
                closeErr := writef.Close()
            }()
            writef, err = cache.openFile(cachefilename, os.O_WRONLY)

If cache.openFile fails, is writef nil or a bogus File object? If it is nil, then closing it would panic?

The interaction between KeepClient and keepViaHTTP needs to be explained better. I think what's happening is this:

  • KeepClient implements KeepGateway with a disk cache, if a block isn't available in the cache, the disk cache calls keepViaHTTP which calls back to KeepClient
  • The existing KeepClient API doesn't change. It doesn't use the cache.
  • The KeepGateway API is new, and it does use the cache.

I guess in a perfect world these would be separate, but in the world we have, it's a much easier (and less disruptive) to continue passing around the KeepClient object and migrate users over to the KeepGateway API as needed?

Actions #24

Updated by Tom Clegg 3 months ago

Peter Amstutz wrote in #note-23:

[...]

Looks like this will call os.Remove(tmpfilename) whether the file was renamed or not. It also seems like it will double-call tmpfile.Close(). Since the errors are ignored it doesn't fail, but it seems like it is possible (albeit with low probability) that those calls could do something unwanted. Might benefit from a comment explaining why nothing bad is going to happen, or being more defensive and setting tmpfilename / tmpfile to nil once they've been cleaned up, and checking for that in the defer.

Yes, both are no-ops in the happy path. Added comments.

If cache.heldopen is nil, wouldn't the first line (cache.heldopen[cachefilename]) panic before we get to the last line where you assign cache.heldopen ? Or does Go semantics allow you to do lookups on nil maps?

Yes, for read operations, nil is equivalent to an empty map.

It's fine not to want to fuss around with a LRU stack or check atimes, but instead of closing every single file wouldn't it be more efficient to pick a few at random to evict? Although I guess the assumption here is that this happens infrequently, and it is more likely we are evicting blocks because of too much disk usage than too many file handles.

Yes. The overhead of reopening files is quite low. In a pathological case where we read once from each of 10K files, then close them all, then open nearly all of the same 10K files again, I don't think our chances are very good of correctly guessing which of the nearly-10K files to keep open.

Practically speaking I think the current approach will have a low enough open/close overhead that it's not worth optimizing.

This comment seems out of date since there's no call to wg.Done().

Indeed. Fixed.

I thought you had to call `make(map[string]*writeprogress)`, is this a new syntax? Or is the behavior different somehow?

Same. It's just an empty literal map, like map[string]*writeprogress{"foo": nil} but with no entries.

If cache.openFile fails, is writef nil or a bogus File object? If it is nil, then closing it would panic?

Oops, yes, that would panic. Fixed.

The interaction between KeepClient and keepViaHTTP needs to be explained better. I think what's happening is this:

  • KeepClient implements KeepGateway with a disk cache, if a block isn't available in the cache, the disk cache calls keepViaHTTP which calls back to KeepClient

Yes, that's right.

  • The existing KeepClient API doesn't change. It doesn't use the cache.

The existing KeepClient API doesn't change.

PutHR, PutHB, PutB, PutR methods do use the cache. They are now implemented as wrappers around the new API (BlockWrite).

Get method doesn't use the cache because I missed this part. It should be reimplemented as a wrapper around the new API (BlockRead).

(We could skip it and still get most of the benefit of this change, since we're mostly focused on improving WebDAV performance and keep-web doesn't use Get()... but I don't think it will be especially hard so it seems more reasonable to do it anyway.)

  • The KeepGateway API is new, and it does use the cache.

The KeepGateway API is new-ish (the Go SDK "filesystem" code has been using ReadAt and BlockWrite for a long time, but now the interface has a public name)... and it does use the cache.

I guess in a perfect world these would be separate, but in the world we have, it's a much easier (and less disruptive) to continue passing around the KeepClient object and migrate users over to the KeepGateway API as needed?

Yes. I think retiring the old interface is a separate adventure.

Updates:

20318-disk-cache @ 8e24aa37b7a2c788dd706013a36da6ca975fb981 -- developer-run-tests: #3987

TODO: implement (*KeepClient)Get() as a wrapper around the new API so it uses the cache.

Actions #25

Updated by Tom Clegg 3 months ago

20318-disk-cache @ cce2c381181ff560bc134845eaea91939e1f8888 -- developer-run-tests: #3995

Enabling disk cache in Get was more effort than I thought, but also helped reveal some (easily fixed) bugs. See commit messages for e349b01188, 295d13d6d0, f9cb9d25da, 3583e494ed

I added a special value for disk cache size (1 byte) that bypasses the cache entirely. This is now used by
  • several tests that get very confused by a cache (e.g., keep-rsync has a sanity check "if I write a new block only to cluster A, I shouldn't be able to retrieve it from cluster B").
  • keepstore when pulling blocks from other keepstores (we might want to revisit this in future, but for now I don't think we want keepstore to unexpectedly start using disk space for a local block cache that almost never gets hits because it's only used for pulls)
  • keepproxy (using the cache actually makes sense here, but I didn't want to get hung up on it right now -- we might want to go straight to "replace keepproxy and keepstore using new KeepGateway strategy" anyway)
  • keep-rsync (similar reasoning as keepstore)

Also did a small amount of code cleanup in places I visited while debugging.

Actions #26

Updated by Tom Clegg 3 months ago

20318-disk-cache @ f6e6f83268a4665a050d10b4a790906598dc1018 -- developer-run-tests: #3996

Fixes a bug exposed by test failure in #note-25: cache would return an error to caller if a newly written cache file was deleted while writing, or immediately after writing before being read.

Actions #27

Updated by Peter Amstutz 2 months ago

Are these two comments in keep_cache.go still accurate since 295d13d6d07dd2a659d93d026e5a7505cbc42936 ?

    if err != nil {
        // If the copy-from-backend goroutine encountered an
        // error before copying enough bytes to satisfy our
        // request, we return that error.
    } else {
        // Regardless of whether the copy-from-backend
        // goroutine succeeded, or failed after copying the
        // bytes we need, the only errors we need to report
        // are errors reading from the cache file.
        return sharedf.ReadAt(dst, int64(offset))
    }
Actions #28

Updated by Tom Clegg 2 months ago

Actions #29

Updated by Peter Amstutz 2 months ago

  • Target version changed from Development 2024-01-17 sprint to Development 2024-01-31 sprint
Actions #30

Updated by Tom Clegg 2 months ago

Unrelated ruby dependency thing fixed:

20318-disk-cache @ 44c2f5790059f3ec0380fbdb659489d8b02831b9 -- developer-run-tests: #4003

Actions #31

Updated by Tom Clegg 2 months ago

  • % Done changed from 0 to 100
  • Status changed from In Progress to Resolved
Actions #32

Updated by Tom Clegg about 1 month ago

  • Related to Bug #21428: keepclient test StandaloneSuite.TestPutHR fails intermittently added
Actions

Also available in: Atom PDF