Bug #22184
closedConsistent WebDAV behavior in load-balanced installation
Description
- keep-web IntegrationSuite.TestConcurrentWrites fails intermittently
- at 9cb3284ca1, ran the test 100x, got 4x failures.
The underlying problem was that (since #21702) keep-web's cache behaves eventually-consistent in a way that is likely to confuse clients (e.g., if a client reads a file soon after updating/replacing that file with new content, the old content might be returned).
Before #21702, this problem only existed in a load-balanced setup -- as long as there was only one keep-web process, writes to a given collection ID were serialized and the cached filesystem was updated after each write, so no request used an outdated view of a collection.
This issue now aims to fix the problem even for load-balanced setups.
Files
Related issues
Updated by Tom Clegg about 1 month ago
- All observed failures are in the n=8 iteration (lots of goroutines operating on 8 different dirs in the same collection) -- no failures when running concurrent goroutines on 1, 2, or 4 different dirs
- All observed failures return 409 for "PUT dirN/fileN"
- The only case where the webdav handler returns 409 for PUT is when the parent directory does not exist
- The test case has a barrier between MKCOL dirN and PUT dirN/fileN
- In the attached log,
PUT i=0/j=0,1,2,3,4,5,6
fail, thenPUT i=0/j=7
succeeds
This seems to indicate that the side effect of MKCOL
is reliably preserved, but calling collectionDir.Sync()
at the end of MKCOL
is not enough to guarantee that the created directory is visible during a future request.
The fact that we only see the problem with n=8
hints that the problem is a race between concurrent collectionDir.Sync()
calls for different directories in the same collection.
Updated by Peter Amstutz about 1 month ago
- Target version changed from Development 2024-10-09 sprint to Development 2024-10-23 sprint
Updated by Peter Amstutz 29 days ago
- Target version changed from Development 2024-10-23 sprint to Development 2024-11-06 sprint
Updated by Tom Clegg 22 days ago
The test failure is caused by concurrent MKCOL requests:
- req A uses replace_files API to atomically create dir A
- req A calls Sync
- req A Sync loads collection (which now contains dir A but not dir B)
- req B uses replace_files API to atomically create dir B
- req B calls Sync
- req B Sync loads collection (which now contains dirs A and B)
- req B Sync updates sessionFS with loaded collection, so sessionFS now contains dirs A and B
- req A Sync updates sessionFS with loaded collection, so sessionFS now contains dir A but not dir B
- req A returns OK response
- req B returns OK response
Now, the caller who made req B has seen an OK response, so it knows dir B was created successfully, if it does another req like "PUT B/b", it will fail because dir B does not exist in the sessionFS cache.
22184-concurrent-mkcol @ 21c68271ff3ddbac921e98e43229fb669eb66223 -- developer-run-tests: #4544
This branch addresses the issue by (instead of calling Sync after replace_files in an MKCOL req) calling Sync in any request that would fail based on the current sessionFS state but could succeed if Sync loads more recent changes.
Drawbacks:- still presents inconsistency issues when existing files are updated/replaced with new content -- even when only a single keep-web service is running (e.g., write A, read A, replace A with new content, wait W, read A → old or new content is returned depending on how long interval W is)
- attempts to retrieve nonexistent files are now inefficient (Sync is much slower than just checking sessionFS)
however...
Conclusion of team discussion was that keep-web should instead- maintain a connection to the postgres database, and
- look up uuid→pdh before each collection-level read/write op, and call Sync if it's changed
The lookup should be fast (bypassing controller, rails, permission checks, etc) and can actually guarantee that it takes into account all prior operations that were completed before the current request was started.
Updated by Tom Clegg 20 days ago
22184-concurrent-mkcol @ 56a00a9fb5148717fe7683261e0dde4d319fda27 -- developer-run-tests: #4547
- Sync (reload) target collection if PDH has changed. This makes keep-web behave predictably even in a load-balanced setup.
- Fix handling of concurrent Sync calls (don't apply upstream changes if another Sync has already applied upstream changes that were fetched more recently). Note this alone would be enough to fix the flaky test.
Now each request in a sequence of PUT requests will cause a collection sync/reload. This is not ideal, and there are special cases where we could optimize that out, but for now I'm inclined to keep it simple.
Updated by Tom Clegg 20 days ago
- Related to Feature #21702: keep-web uses replace_files API to add/replace/rename/remove files atomically added
Updated by Tom Clegg 17 days ago
- Related to Feature #22279: keep-web: eliminate extra PDH lookup per request in series of PUT requests added
Updated by Tom Clegg 17 days ago
22184-concurrent-mkcol @ e8a6b9e20c482a6c4d9e70f38d765737ba465a39 -- developer-run-tests: #4549
- All agreed upon points are implemented / addressed. Describe changes from pre-implementation design.
- ✅ Fixes flaky test (tested by running 100x, no failures)
- ✅ Also fixes undesirable "eventually consistent" behavior in a load-balanced setup
- ✅ Also fixes SDK bug in collectionFS: the Sys() function used to return the original
*Collection
used to set up the collectionFS. It now returns a*Collection
with the last PDH received from the server when saving/loading changes.
- Anything not implemented (discovered or discussed during work) has a follow-up story.
- Code is tested and passing, both automated and manual, what manual testing was done is described.
- ✅
- New or changed UX/UX and has gotten feedback from stakeholders.
- N/A
- Documentation has been updated.
- N/A
- Behaves appropriately at the intended scale (describe intended scale).
- The uuid-pdh check will add a short delay on every request, and a somewhat longer delay when the collection has changed. The worst case is series of PUT requests. See #note-9 re optimizing this.
- Considered backwards and forwards compatibility issues between client and server.
- N/A
- Follows our coding standards and GUI style guidelines.
- ✅
Updated by Lucas Di Pentima 17 days ago
With this change, keep-web
becomes a "first-class service" in the deployment & security sense, as it needs direct DB access. Some comments:
- I guess it was already discussed/taken into account, but just in case: wouldn't this increase the attack surface dramatically for public clusters? Keep-web nodes will need firewall DB access opened, and our setup provides an all-or-nothing approach (eg: we don't use different DB accounts for limited access operations)
- Documentation that I think will need updates
- Upgrade Notes page: users that deployed their clusters without using our installer will likely need a heads up in case they have strict firewall policies.
- Architecture page: The 30k ft graph has the keep-web could show the direct DB connection.
- The
install/install-keep-web.html
page also might need updating, as it says "... It can be installed anywhere with access to Keep services..."
- Now that keep-web requires an additional set of config params (The DB access credentials, etc), it may be necessary to update the diagnostics tool to make sure this access pattern works?
- Also, I think we'll need to review our installer code to reflect this change. That surely can be a separate ticket.
Updated by Tom Clegg 17 days ago
- via controller 24.4 ms ("timeTotal":0.024449)
- via postgres 0.2 ms (planning time 0.124 ms, execution time 0.065 ms)
- total 306 ms
- timeToStatus 0.3 ms
Based on these numbers, looking up the PDH via controller would increase best-case keep-web request latency from ~0.3 ms to ~25 ms, whereas looking up via direct postgres connection would increase latency from ~0.3 ms to ~0.5 ms. Relatively speaking that's a huge difference, OTOH total latency of 25 ms might not be a deal-breaker.
Updated by Lucas Di Pentima 17 days ago
I agree with your comment, an ~80x latency increase is huge, but it would be only noticeable when doing lots of small file requests. OTOH keeping keep-web detached from the DB sounds like a worthy compromise.
Updated by Peter Amstutz 17 days ago
25ms is more than I was hoping (I was hoping for single-digit milliseconds), but I'm also leaning towards the simpler/safer solution of continuing to use the API.
Two thoughts:
- We could have keep-web check for database connectivity on startup. If the database is not available, it prints a warning and fall back to using the API.
- It seems entirely plausible that cutting Rails out of the picture and generating the response directly in controller could get it down to that single-digit millisecond range.
Updated by Tom Clegg 17 days ago
- File Arvados_arch.png Arvados_arch.png added
After some discussion, we decided to keep the new database access requirement and update the documentation (and installers) accordingly, rather than add back ~25 ms of the ~250 ms latency that we eliminated in #2960.
22184-concurrent-mkcol @ 6d82e1134c502e19533d82ad10e9dc9d17c5b946 -- developer-run-tests: #4550
Updated by Lucas Di Pentima 17 days ago
FYI: tordo
's PG config is already fixed to accept connections from keep-web
's host.
Updated by Lucas Di Pentima 17 days ago
With the documentation updates, this LGTM. Thanks!
Updated by Peter Amstutz 17 days ago
Tom Clegg wrote in #note-19:
To be honest I find the lines swooshing around pretty hard to follow. The original graphic had really simple straight lines between the boxes. The challenge is that the keep-web box is in an awkward place to show it connecting to the database.
I'm fine with merging the branch but I might want to take another pass at tinkering with the graph.
Updated by Lucas Di Pentima 17 days ago
- Related to Feature #22281: Installer allows keep-web to connect to database added
Updated by Tom Clegg 17 days ago
- Status changed from In Progress to Resolved
Applied in changeset arvados|1a8dbed0ba6f4f2bf1222bce481ded309a78d243.