Project

General

Profile

Actions

Bug #16133

closed

[controller] add loop prevention to federation lookups in new code path

Added by Ward Vandewege about 4 years ago. Updated about 4 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Story points:
-

Description

The new controller code is missing loop detection for block lookups on federation peers.


Subtasks 1 (0 open1 closed)

Task #16135: Review 16133-federation-loopResolvedPeter Amstutz02/06/2020Actions

Related issues

Related to Arvados - Bug #16134: [controller] handle unreachable federation peer betterResolvedActions
Actions #1

Updated by Ward Vandewege about 4 years ago

  • Description updated (diff)
Actions #2

Updated by Tom Clegg about 4 years ago

  • Assigned To set to Tom Clegg
  • Status changed from New to In Progress
Actions #3

Updated by Peter Amstutz about 4 years ago

The old strategy was that controller wouldn't forward any request that said it was from another instance of controller. Is there a reason not to do that any more?

The new strategy is that specifically for collection GET requests, it adds a new query parameter that tracks which clusters the query has passed through - in practice this is only ever the originating cluster, because it stops propagating the query after that.

I'm a little confused this isn't a more general solution, if every federated method needs to separately account for loops that seems error prone.

Actions #4

Updated by Tom Clegg about 4 years ago

The HTTP header seems like the wrong layer (if we use a transport other than HTTP, we still want to detect cycles). It changes the meaning of the request so we'd probably need a "Vary" header to do it right. Putting a trail in the request options just seems more explicit and easier, and it can distinguish cycles from multi-hop.

"Never makes sense to take multiple hops" is specific to collections, because of blob signatures. OTOH we also want to avoid taking multiple paths to a single destination, like a→b→d and a→c→d... which isn't specific to collections. So (at least until we need / can handle the multi-hop case) this might as well be handled in tryLocalThenRemotes.

Most federation methods don't need to detect loops because the upstream requests only have UUIDs that match their next hop.

16133-federation-loop @ d8861941ef704dd729252ee1592e48a082798b65 -- developer-run-tests: #1719

Actions #5

Updated by Ward Vandewege about 4 years ago

I built the binary from source and am testing it on 9tee4.

Both 9tee4 and c97qk are configured in a 3-way federation with 4xphq. The latter is powered off. c97qk is running controller version 1.5.0.pre20200115150419-1 (it is no longer auto-deployed).

9tee4 is running controller from the tip of the 16133-federation-loop branch at 14bb7aa24ed7a06ce94e59f64ab32bdb44641168

Searching for the block 9f26a86b6030a69ad222cf67d71c9502+65 (this is the one that caused the loop last night) on workbench.9tee4 eventually gave me a fiddlesticks with this error:

{
  ":errors":[
    "errors: [request failed: https://9tee4.arvadosapi.com/arvados/v1/collections/9f26a86b6030a69ad222cf67d71c9502+65?forwarded_for=9tee4-&reader_tokens=%5B%22v2%2F9tee4-gj3su-dbww194nbb2pcwb%2F30c99ba7191be91634d1439eab1ee4f969d12699%22%5D: 404 Not Found: not found request failed: https://c97qk.arvadosapi.com/arvados/v1/collections/9f26a86b6030a69ad222cf67d71c9502+65?forwarded_for=9tee4-&reader_tokens=%5B%22v2%2F9tee4-gj3su-dbww194nbb2pcwb%2F2938ad3eb7eb5c1ff6b5bb868ac90cc8d54d06d4%22%5D: 401 Unauthorized: request failed: http://localhost:8000/arvados/v1/collections/9f26a86b6030a69ad222cf67d71c9502+65?reader_tokens=%5B%22v2%2F9tee4-gj3su-dbww194nbb2pcwb%2F2938ad3eb7eb5c1ff6b5bb868ac90cc8d54d06d4%22%5D: 401 Unauthorized: Not logged in (req-2yf2393hw25yh35g10uo) Get https://4xphq.arvadosapi.com/arvados/v1/collections/9f26a86b6030a69ad222cf67d71c9502+65?forwarded_for=9tee4-&reader_tokens=%5B%22v2%2F9tee4-gj3su-dbww194nbb2pcwb%2Fff9d848d98373dd58e3cecad335d52d4765d580f%22%5D: dial tcp 54.209.184.185:443: i/o timeout]" 
  ]
}

This IP in the error message, 54.209.184.185, is 4xphq's api server IP. Cf. #16134.

I no longer see a loop looking for that block, so I think this patch works. There is no excessive use of file descriptors either.

Actions #6

Updated by Ward Vandewege about 4 years ago

  • Related to Bug #16134: [controller] handle unreachable federation peer better added
Actions #7

Updated by Peter Amstutz about 4 years ago

Tom Clegg wrote:

The HTTP header seems like the wrong layer (if we use a transport other than HTTP, we still want to detect cycles). It changes the meaning of the request so we'd probably need a "Vary" header to do it right. Putting a trail in the request options just seems more explicit and easier, and it can distinguish cycles from multi-hop.

Got it, thanks for explaining the rationale.

"Never makes sense to take multiple hops" is specific to collections, because of blob signatures. OTOH we also want to avoid taking multiple paths to a single destination, like a→b→d and a→c→d... which isn't specific to collections. So (at least until we need / can handle the multi-hop case) this might as well be handled in tryLocalThenRemotes.

Most federation methods don't need to detect loops because the upstream requests only have UUIDs that match their next hop.

Right, it's the only one out of the current set of supported federation requests, but in the future I expect we'll have other open-ended queries that don't feature UUIDs directing them to a specific cluster.

Putting the check in tryLocalThenRemotes() does also have the minor benefit of not doing the extra work of starting, immediately failing, and gathering results from N extra goroutines.

16133-federation-loop @ d8861941ef704dd729252ee1592e48a082798b65 -- developer-run-tests: #1719

This LGTM.

Actions #8

Updated by Anonymous about 4 years ago

  • % Done changed from 0 to 100
  • Status changed from In Progress to Resolved
Actions

Also available in: Atom PDF