Bug #16133

[controller] add loop prevention to federation lookups in new code path

Added by Ward Vandewege 5 months ago. Updated 5 months ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Start date:
02/06/2020
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
-

Description

The new controller code is missing loop detection for block lookups on federation peers.


Subtasks

Task #16135: Review 16133-federation-loopResolvedPeter Amstutz


Related issues

Related to Arvados - Bug #16134: [controller] handle unreachable federation peer betterResolved

Associated revisions

Revision 118908c3
Added by Tom Clegg 5 months ago

Merge branch '16133-federation-loop'

fixes #16133

Arvados-DCO-1.1-Signed-off-by: Tom Clegg <>

History

#1 Updated by Ward Vandewege 5 months ago

  • Description updated (diff)

#2 Updated by Tom Clegg 5 months ago

  • Assigned To set to Tom Clegg
  • Status changed from New to In Progress

#3 Updated by Peter Amstutz 5 months ago

The old strategy was that controller wouldn't forward any request that said it was from another instance of controller. Is there a reason not to do that any more?

The new strategy is that specifically for collection GET requests, it adds a new query parameter that tracks which clusters the query has passed through - in practice this is only ever the originating cluster, because it stops propagating the query after that.

I'm a little confused this isn't a more general solution, if every federated method needs to separately account for loops that seems error prone.

#4 Updated by Tom Clegg 5 months ago

The HTTP header seems like the wrong layer (if we use a transport other than HTTP, we still want to detect cycles). It changes the meaning of the request so we'd probably need a "Vary" header to do it right. Putting a trail in the request options just seems more explicit and easier, and it can distinguish cycles from multi-hop.

"Never makes sense to take multiple hops" is specific to collections, because of blob signatures. OTOH we also want to avoid taking multiple paths to a single destination, like a→b→d and a→c→d... which isn't specific to collections. So (at least until we need / can handle the multi-hop case) this might as well be handled in tryLocalThenRemotes.

Most federation methods don't need to detect loops because the upstream requests only have UUIDs that match their next hop.

16133-federation-loop @ d8861941ef704dd729252ee1592e48a082798b65 -- https://ci.arvados.org/view/Developer/job/developer-run-tests/1719/

#5 Updated by Ward Vandewege 5 months ago

I built the binary from source and am testing it on 9tee4.

Both 9tee4 and c97qk are configured in a 3-way federation with 4xphq. The latter is powered off. c97qk is running controller version 1.5.0.pre20200115150419-1 (it is no longer auto-deployed).

9tee4 is running controller from the tip of the 16133-federation-loop branch at 14bb7aa24ed7a06ce94e59f64ab32bdb44641168

Searching for the block 9f26a86b6030a69ad222cf67d71c9502+65 (this is the one that caused the loop last night) on workbench.9tee4 eventually gave me a fiddlesticks with this error:

{
  ":errors":[
    "errors: [request failed: https://9tee4.arvadosapi.com/arvados/v1/collections/9f26a86b6030a69ad222cf67d71c9502+65?forwarded_for=9tee4-&reader_tokens=%5B%22v2%2F9tee4-gj3su-dbww194nbb2pcwb%2F30c99ba7191be91634d1439eab1ee4f969d12699%22%5D: 404 Not Found: not found request failed: https://c97qk.arvadosapi.com/arvados/v1/collections/9f26a86b6030a69ad222cf67d71c9502+65?forwarded_for=9tee4-&reader_tokens=%5B%22v2%2F9tee4-gj3su-dbww194nbb2pcwb%2F2938ad3eb7eb5c1ff6b5bb868ac90cc8d54d06d4%22%5D: 401 Unauthorized: request failed: http://localhost:8000/arvados/v1/collections/9f26a86b6030a69ad222cf67d71c9502+65?reader_tokens=%5B%22v2%2F9tee4-gj3su-dbww194nbb2pcwb%2F2938ad3eb7eb5c1ff6b5bb868ac90cc8d54d06d4%22%5D: 401 Unauthorized: Not logged in (req-2yf2393hw25yh35g10uo) Get https://4xphq.arvadosapi.com/arvados/v1/collections/9f26a86b6030a69ad222cf67d71c9502+65?forwarded_for=9tee4-&reader_tokens=%5B%22v2%2F9tee4-gj3su-dbww194nbb2pcwb%2Fff9d848d98373dd58e3cecad335d52d4765d580f%22%5D: dial tcp 54.209.184.185:443: i/o timeout]" 
  ]
}

This IP in the error message, 54.209.184.185, is 4xphq's api server IP. Cf. #16134.

I no longer see a loop looking for that block, so I think this patch works. There is no excessive use of file descriptors either.

#6 Updated by Ward Vandewege 5 months ago

  • Related to Bug #16134: [controller] handle unreachable federation peer better added

#7 Updated by Peter Amstutz 5 months ago

Tom Clegg wrote:

The HTTP header seems like the wrong layer (if we use a transport other than HTTP, we still want to detect cycles). It changes the meaning of the request so we'd probably need a "Vary" header to do it right. Putting a trail in the request options just seems more explicit and easier, and it can distinguish cycles from multi-hop.

Got it, thanks for explaining the rationale.

"Never makes sense to take multiple hops" is specific to collections, because of blob signatures. OTOH we also want to avoid taking multiple paths to a single destination, like a→b→d and a→c→d... which isn't specific to collections. So (at least until we need / can handle the multi-hop case) this might as well be handled in tryLocalThenRemotes.

Most federation methods don't need to detect loops because the upstream requests only have UUIDs that match their next hop.

Right, it's the only one out of the current set of supported federation requests, but in the future I expect we'll have other open-ended queries that don't feature UUIDs directing them to a specific cluster.

Putting the check in tryLocalThenRemotes() does also have the minor benefit of not doing the extra work of starting, immediately failing, and gathering results from N extra goroutines.

16133-federation-loop @ d8861941ef704dd729252ee1592e48a082798b65 -- https://ci.arvados.org/view/Developer/job/developer-run-tests/1719/

This LGTM.

#8 Updated by Anonymous 5 months ago

  • % Done changed from 0 to 100
  • Status changed from In Progress to Resolved

Also available in: Atom PDF