Project

General

Profile

Actions

Bug #22184

closed

Consistent WebDAV behavior in load-balanced installation

Added by Tom Clegg 2 months ago. Updated about 2 months ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Keep
Story points:
-
Release:
Release relationship:
Auto

Description

Originally this was a flaky test bug report:
  • keep-web IntegrationSuite.TestConcurrentWrites fails intermittently
  • at 9cb3284ca1, ran the test 100x, got 4x failures.

The underlying problem was that (since #21702) keep-web's cache behaves eventually-consistent in a way that is likely to confuse clients (e.g., if a client reads a file soon after updating/replacing that file with new content, the old content might be returned).

Before #21702, this problem only existed in a load-balanced setup -- as long as there was only one keep-web process, writes to a given collection ID were serialized and the cached filesystem was updated after each write, so no request used an outdated view of a collection.

This issue now aims to fix the problem even for load-balanced setups.


Files

22184-log.txt (26.5 KB) 22184-log.txt Tom Clegg, 10/08/2024 05:36 PM
Arvados_arch.png (145 KB) Arvados_arch.png Tom Clegg, 11/04/2024 09:40 PM

Subtasks 1 (1 open0 closed)

Task #22190: Review 22184-concurrent-mkcolIn ProgressTom Clegg11/04/2024Actions

Related issues 3 (1 open2 closed)

Related to Arvados - Feature #21702: keep-web uses replace_files API to add/replace/rename/remove files atomicallyResolvedTom Clegg09/25/2024Actions
Related to Arvados - Feature #22279: keep-web: eliminate extra PDH lookup per request in series of PUT requestsNewTom CleggActions
Related to Arvados - Feature #22281: Installer allows keep-web to connect to databaseResolvedLucas Di PentimaActions
Actions #1

Updated by Tom Clegg 2 months ago

  • Status changed from New to In Progress
Actions #2

Updated by Tom Clegg 2 months ago

Actions #3

Updated by Tom Clegg 2 months ago

  • Description updated (diff)
Actions #4

Updated by Peter Amstutz 2 months ago

  • Release set to 70
Actions #5

Updated by Tom Clegg 2 months ago

Clues so far:
  • All observed failures are in the n=8 iteration (lots of goroutines operating on 8 different dirs in the same collection) -- no failures when running concurrent goroutines on 1, 2, or 4 different dirs
  • All observed failures return 409 for "PUT dirN/fileN"
  • The only case where the webdav handler returns 409 for PUT is when the parent directory does not exist
  • The test case has a barrier between MKCOL dirN and PUT dirN/fileN
  • In the attached log, PUT i=0/j=0,1,2,3,4,5,6 fail, then PUT i=0/j=7 succeeds

This seems to indicate that the side effect of MKCOL is reliably preserved, but calling collectionDir.Sync() at the end of MKCOL is not enough to guarantee that the created directory is visible during a future request.

The fact that we only see the problem with n=8 hints that the problem is a race between concurrent collectionDir.Sync() calls for different directories in the same collection.

Actions #6

Updated by Peter Amstutz 2 months ago

  • Target version changed from Development 2024-10-09 sprint to Development 2024-10-23 sprint
Actions #7

Updated by Peter Amstutz about 2 months ago

  • Target version changed from Development 2024-10-23 sprint to Development 2024-11-06 sprint
Actions #8

Updated by Tom Clegg about 2 months ago

The test failure is caused by concurrent MKCOL requests:

  • req A uses replace_files API to atomically create dir A
  • req A calls Sync
  • req A Sync loads collection (which now contains dir A but not dir B)
  • req B uses replace_files API to atomically create dir B
  • req B calls Sync
  • req B Sync loads collection (which now contains dirs A and B)
  • req B Sync updates sessionFS with loaded collection, so sessionFS now contains dirs A and B
  • req A Sync updates sessionFS with loaded collection, so sessionFS now contains dir A but not dir B
  • req A returns OK response
  • req B returns OK response

Now, the caller who made req B has seen an OK response, so it knows dir B was created successfully, if it does another req like "PUT B/b", it will fail because dir B does not exist in the sessionFS cache.

22184-concurrent-mkcol @ 21c68271ff3ddbac921e98e43229fb669eb66223 -- developer-run-tests: #4544

This branch addresses the issue by (instead of calling Sync after replace_files in an MKCOL req) calling Sync in any request that would fail based on the current sessionFS state but could succeed if Sync loads more recent changes.

Drawbacks:
  • still presents inconsistency issues when existing files are updated/replaced with new content -- even when only a single keep-web service is running (e.g., write A, read A, replace A with new content, wait W, read A → old or new content is returned depending on how long interval W is)
  • attempts to retrieve nonexistent files are now inefficient (Sync is much slower than just checking sessionFS)

however...

Conclusion of team discussion was that keep-web should instead
  • maintain a connection to the postgres database, and
  • look up uuid→pdh before each collection-level read/write op, and call Sync if it's changed

The lookup should be fast (bypassing controller, rails, permission checks, etc) and can actually guarantee that it takes into account all prior operations that were completed before the current request was started.

Actions #9

Updated by Tom Clegg about 2 months ago

22184-concurrent-mkcol @ 56a00a9fb5148717fe7683261e0dde4d319fda27 -- developer-run-tests: #4547

  • Sync (reload) target collection if PDH has changed. This makes keep-web behave predictably even in a load-balanced setup.
  • Fix handling of concurrent Sync calls (don't apply upstream changes if another Sync has already applied upstream changes that were fetched more recently). Note this alone would be enough to fix the flaky test.

Now each request in a sequence of PUT requests will cause a collection sync/reload. This is not ideal, and there are special cases where we could optimize that out, but for now I'm inclined to keep it simple.

Actions #10

Updated by Tom Clegg about 2 months ago

  • Category changed from Tests to Keep
  • Description updated (diff)
  • Subject changed from keep-web IntegrationSuite.TestConcurrentWrites fails intermittently to Consistent WebDAV behavior in load-balanced installation
Actions #11

Updated by Tom Clegg about 2 months ago

  • Related to Feature #21702: keep-web uses replace_files API to add/replace/rename/remove files atomically added
Actions #12

Updated by Tom Clegg about 2 months ago

  • Related to Feature #22279: keep-web: eliminate extra PDH lookup per request in series of PUT requests added
Actions #13

Updated by Tom Clegg about 2 months ago

22184-concurrent-mkcol @ e8a6b9e20c482a6c4d9e70f38d765737ba465a39 -- developer-run-tests: #4549

  • All agreed upon points are implemented / addressed. Describe changes from pre-implementation design.
    • ✅ Fixes flaky test (tested by running 100x, no failures)
    • ✅ Also fixes undesirable "eventually consistent" behavior in a load-balanced setup
    • ✅ Also fixes SDK bug in collectionFS: the Sys() function used to return the original *Collection used to set up the collectionFS. It now returns a *Collection with the last PDH received from the server when saving/loading changes.
  • Anything not implemented (discovered or discussed during work) has a follow-up story.
  • Code is tested and passing, both automated and manual, what manual testing was done is described.
  • New or changed UX/UX and has gotten feedback from stakeholders.
    • N/A
  • Documentation has been updated.
    • N/A
  • Behaves appropriately at the intended scale (describe intended scale).
    • The uuid-pdh check will add a short delay on every request, and a somewhat longer delay when the collection has changed. The worst case is series of PUT requests. See #note-9 re optimizing this.
  • Considered backwards and forwards compatibility issues between client and server.
    • N/A
  • Follows our coding standards and GUI style guidelines.
Actions #14

Updated by Lucas Di Pentima about 2 months ago

With this change, keep-web becomes a "first-class service" in the deployment & security sense, as it needs direct DB access. Some comments:

  • I guess it was already discussed/taken into account, but just in case: wouldn't this increase the attack surface dramatically for public clusters? Keep-web nodes will need firewall DB access opened, and our setup provides an all-or-nothing approach (eg: we don't use different DB accounts for limited access operations)
  • Documentation that I think will need updates
    • Upgrade Notes page: users that deployed their clusters without using our installer will likely need a heads up in case they have strict firewall policies.
    • Architecture page: The 30k ft graph has the keep-web could show the direct DB connection.
    • The install/install-keep-web.html page also might need updating, as it says "... It can be installed anywhere with access to Keep services..."
  • Now that keep-web requires an additional set of config params (The DB access credentials, etc), it may be necessary to update the diagnostics tool to make sure this access pattern works?
  • Also, I think we'll need to review our installer code to reflect this change. That surely can be a separate ticket.
Actions #15

Updated by Tom Clegg about 2 months ago

Time to look up PDH of an example collection on pirca:
  • via controller 24.4 ms ("timeTotal":0.024449)
  • via postgres 0.2 ms (planning time 0.124 ms, execution time 0.065 ms)
keep-web reported time to download a small (~2 MB) file from a cached collection (before this change):
  • total 306 ms
  • timeToStatus 0.3 ms

Based on these numbers, looking up the PDH via controller would increase best-case keep-web request latency from ~0.3 ms to ~25 ms, whereas looking up via direct postgres connection would increase latency from ~0.3 ms to ~0.5 ms. Relatively speaking that's a huge difference, OTOH total latency of 25 ms might not be a deal-breaker.

Actions #16

Updated by Lucas Di Pentima about 2 months ago

I agree with your comment, an ~80x latency increase is huge, but it would be only noticeable when doing lots of small file requests. OTOH keeping keep-web detached from the DB sounds like a worthy compromise.

Actions #17

Updated by Peter Amstutz about 2 months ago

25ms is more than I was hoping (I was hoping for single-digit milliseconds), but I'm also leaning towards the simpler/safer solution of continuing to use the API.

Two thoughts:

  • We could have keep-web check for database connectivity on startup. If the database is not available, it prints a warning and fall back to using the API.
  • It seems entirely plausible that cutting Rails out of the picture and generating the response directly in controller could get it down to that single-digit millisecond range.
Actions #19

Updated by Tom Clegg about 2 months ago

After some discussion, we decided to keep the new database access requirement and update the documentation (and installers) accordingly, rather than add back ~25 ms of the ~250 ms latency that we eliminated in #2960.

22184-concurrent-mkcol @ 6d82e1134c502e19533d82ad10e9dc9d17c5b946 -- developer-run-tests: #4550

Actions #20

Updated by Lucas Di Pentima about 2 months ago

FYI: tordo's PG config is already fixed to accept connections from keep-web 's host.

Actions #21

Updated by Lucas Di Pentima about 2 months ago

With the documentation updates, this LGTM. Thanks!

Actions #22

Updated by Peter Amstutz about 2 months ago

Tom Clegg wrote in #note-19:

To be honest I find the lines swooshing around pretty hard to follow. The original graphic had really simple straight lines between the boxes. The challenge is that the keep-web box is in an awkward place to show it connecting to the database.

I'm fine with merging the branch but I might want to take another pass at tinkering with the graph.

Actions #23

Updated by Lucas Di Pentima about 2 months ago

  • Related to Feature #22281: Installer allows keep-web to connect to database added
Actions #24

Updated by Tom Clegg about 2 months ago

  • Status changed from In Progress to Resolved
Actions

Also available in: Atom PDF