Bug #16133
closed[controller] add loop prevention to federation lookups in new code path
Added by Ward Vandewege almost 5 years ago. Updated 7 months ago.
Description
The new controller code is missing loop detection for block lookups on federation peers.
Related issues
Updated by Tom Clegg almost 5 years ago
- Assigned To set to Tom Clegg
- Status changed from New to In Progress
16133-federation-loop @ 14bb7aa24ed7a06ce94e59f64ab32bdb44641168 -- developer-run-tests: #1717
Updated by Peter Amstutz almost 5 years ago
The old strategy was that controller wouldn't forward any request that said it was from another instance of controller. Is there a reason not to do that any more?
The new strategy is that specifically for collection GET requests, it adds a new query parameter that tracks which clusters the query has passed through - in practice this is only ever the originating cluster, because it stops propagating the query after that.
I'm a little confused this isn't a more general solution, if every federated method needs to separately account for loops that seems error prone.
Updated by Tom Clegg almost 5 years ago
The HTTP header seems like the wrong layer (if we use a transport other than HTTP, we still want to detect cycles). It changes the meaning of the request so we'd probably need a "Vary" header to do it right. Putting a trail in the request options just seems more explicit and easier, and it can distinguish cycles from multi-hop.
"Never makes sense to take multiple hops" is specific to collections, because of blob signatures. OTOH we also want to avoid taking multiple paths to a single destination, like a→b→d and a→c→d... which isn't specific to collections. So (at least until we need / can handle the multi-hop case) this might as well be handled in tryLocalThenRemotes.
Most federation methods don't need to detect loops because the upstream requests only have UUIDs that match their next hop.
16133-federation-loop @ d8861941ef704dd729252ee1592e48a082798b65 -- developer-run-tests: #1719
Updated by Ward Vandewege almost 5 years ago
I built the binary from source and am testing it on 9tee4.
Both 9tee4 and c97qk are configured in a 3-way federation with 4xphq. The latter is powered off. c97qk is running controller version 1.5.0.pre20200115150419-1 (it is no longer auto-deployed).
9tee4 is running controller from the tip of the 16133-federation-loop branch at 14bb7aa24ed7a06ce94e59f64ab32bdb44641168
Searching for the block 9f26a86b6030a69ad222cf67d71c9502+65 (this is the one that caused the loop last night) on workbench.9tee4 eventually gave me a fiddlesticks with this error:
{ ":errors":[ "errors: [request failed: https://9tee4.arvadosapi.com/arvados/v1/collections/9f26a86b6030a69ad222cf67d71c9502+65?forwarded_for=9tee4-&reader_tokens=%5B%22v2%2F9tee4-gj3su-dbww194nbb2pcwb%2F30c99ba7191be91634d1439eab1ee4f969d12699%22%5D: 404 Not Found: not found request failed: https://c97qk.arvadosapi.com/arvados/v1/collections/9f26a86b6030a69ad222cf67d71c9502+65?forwarded_for=9tee4-&reader_tokens=%5B%22v2%2F9tee4-gj3su-dbww194nbb2pcwb%2F2938ad3eb7eb5c1ff6b5bb868ac90cc8d54d06d4%22%5D: 401 Unauthorized: request failed: http://localhost:8000/arvados/v1/collections/9f26a86b6030a69ad222cf67d71c9502+65?reader_tokens=%5B%22v2%2F9tee4-gj3su-dbww194nbb2pcwb%2F2938ad3eb7eb5c1ff6b5bb868ac90cc8d54d06d4%22%5D: 401 Unauthorized: Not logged in (req-2yf2393hw25yh35g10uo) Get https://4xphq.arvadosapi.com/arvados/v1/collections/9f26a86b6030a69ad222cf67d71c9502+65?forwarded_for=9tee4-&reader_tokens=%5B%22v2%2F9tee4-gj3su-dbww194nbb2pcwb%2Fff9d848d98373dd58e3cecad335d52d4765d580f%22%5D: dial tcp 54.209.184.185:443: i/o timeout]" ] }
This IP in the error message, 54.209.184.185, is 4xphq's api server IP. Cf. #16134.
I no longer see a loop looking for that block, so I think this patch works. There is no excessive use of file descriptors either.
Updated by Ward Vandewege almost 5 years ago
- Related to Bug #16134: [controller] handle unreachable federation peer better added
Updated by Peter Amstutz almost 5 years ago
Tom Clegg wrote:
The HTTP header seems like the wrong layer (if we use a transport other than HTTP, we still want to detect cycles). It changes the meaning of the request so we'd probably need a "Vary" header to do it right. Putting a trail in the request options just seems more explicit and easier, and it can distinguish cycles from multi-hop.
Got it, thanks for explaining the rationale.
"Never makes sense to take multiple hops" is specific to collections, because of blob signatures. OTOH we also want to avoid taking multiple paths to a single destination, like a→b→d and a→c→d... which isn't specific to collections. So (at least until we need / can handle the multi-hop case) this might as well be handled in tryLocalThenRemotes.
Most federation methods don't need to detect loops because the upstream requests only have UUIDs that match their next hop.
Right, it's the only one out of the current set of supported federation requests, but in the future I expect we'll have other open-ended queries that don't feature UUIDs directing them to a specific cluster.
Putting the check in tryLocalThenRemotes() does also have the minor benefit of not doing the extra work of starting, immediately failing, and gathering results from N extra goroutines.
16133-federation-loop @ d8861941ef704dd729252ee1592e48a082798b65 -- developer-run-tests: #1719
This LGTM.
Updated by Anonymous almost 5 years ago
- % Done changed from 0 to 100
- Status changed from In Progress to Resolved
Applied in changeset arvados|118908c39c6ffa0ae8b62cddbdb610c51a461b6d.