Project

General

Profile

Actions

Bug #14458

closed

[controller] collection federation panic send on closed channel

Added by Peter Amstutz over 5 years ago. Updated over 5 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Story points:
-
Release:
Release relationship:
Auto

Description

2018-11-07_17:33:16.37829 {"RequestID":"req-tivf25vy461cnxx5bep7","level":"info","msg":"response","remoteAddr":"127.0.0.1:49982","reqBytes":0,"reqForwardedFor":"172.17.0.6","reqHost":"172.17.0.6:8000","reqMethod":"GET","reqPath":"arvados/v1/collections/cba47aefe5eb3a014a26ec00316b30c1+57","reqQuery":"alt=json","respBytes":239,"respStatus":"OK","respStatusCode":200,"time":"2018-11-07T17:33:16.377898690Z","timeToStatus":0.258516,"timeTotal":0.258538,"timeWriteBody":0.000022}
2018-11-07_17:33:16.38499 panic: send on closed channel
2018-11-07_17:33:16.38501 
2018-11-07_17:33:16.38502 goroutine 251 [running]:
2018-11-07_17:33:16.38502 git.curoverse.com/arvados.git/lib/controller.(*collectionFederatedRequestHandler).ServeHTTP.func1(0xc4200231a0, 0xc4201a6230, 0x8a4bc0, 0xc42022a4c0, 0xc420216040, 0xc420196070, 0xc420360240, 0xc4200283d8, 0x23, 0xc4203601e0, ...)
2018-11-07_17:33:16.38502       /var/lib/gopath/src/git.curoverse.com/arvados.git/lib/controller/fed_collections.go:249 +0x44d
2018-11-07_17:33:16.38503 created by git.curoverse.com/arvados.git/lib/controller.(*collectionFederatedRequestHandler).ServeHTTP
2018-11-07_17:33:16.38503       /var/lib/gopath/src/git.curoverse.com/arvados.git/lib/controller/fed_collections.go:230 +0x4dd

Subtasks 1 (0 open1 closed)

Task #14460: Review 14458-controller-panicResolvedPeter Amstutz11/07/2018Actions

Related issues

Related to Arvados - Feature #14198: [CWL] run steps on remote clustersResolvedPeter Amstutz11/01/2018Actions
Actions #1

Updated by Peter Amstutz over 5 years ago

  • Status changed from New to In Progress
Actions #2

Updated by Peter Amstutz over 5 years ago

  • Description updated (diff)
Actions #3

Updated by Peter Amstutz over 5 years ago

Actions #4

Updated by Peter Amstutz over 5 years ago

  • Assigned To set to Peter Amstutz
Actions #5

Updated by Peter Amstutz over 5 years ago

Actions #6

Updated by Tom Clegg over 5 years ago

With a buffered errorChan, it seems like wg.Done() and cancelFunc() can be called after the error is sent but before <-errorChan wins the select race, which means <-sharedContext.Done() can win the select race while the errors slice is empty, even though there is in fact an error to report.

How about making errorChan:=make(chan error, len(h.handler.Cluster.RemoteClusters)) and, when Done(), collecting errors in a for len(errorChan)>0 loop? Seems like we only care about the errors if we get to the "give up" case, so in the meantime accumulating them in a channel seems more convenient than accumulating them in a slice. #14262#note-12

Actions #7

Updated by Peter Amstutz over 5 years ago

Tom Clegg wrote:

With a buffered errorChan, it seems like wg.Done() and cancelFunc() can be called after the error is sent but before <-errorChan wins the select race, which means <-sharedContext.Done() can win the select race while the errors slice is empty, even though there is in fact an error to report.

Here's my reasoning:

  1. wg.Wait() unblocks to call cancelFunc() only after all goroutines have completed.
  2. At that point, every goroutine that is going to produce an error will have already put its error in the buffered channel
  3. In the loop, the receive on the error channel has priority over the cancellation channel
  4. Therefore, the error channel will be drained before matching the <-sharedContext.Done() case

... so on further reading, I see that the select statement in Go don't actually have priority between channels, so it won't work like that. Argh!

(So the real reason I originally wrote this code using mutexes and shared state is that it was frankly easier for me to reason about than Go channels and all their edge cases)

You're right, it is simpler to keep them in the buffered channel and drain it when it is done.

Now 14458-controller-panic @ 6938e8cff1632d597cfbd333e4d5176805b628c6

Actions #8

Updated by Tom Clegg over 5 years ago

With for range errChan, the behavior changes a bit: if the parent context cancels, the handler blocks until all of the remote-req goroutines notice that and return. (Previously, and still in the success case, the handler returns immediately, letting the remote-req goroutines run until they notice the cancelled context.) Is this deliberate? I suggested "for len(errChan)>0" because I thought the "return immediately, let goroutines run" strategy seemed sensible to use across the board -- e.g., the "request finished" log message would happen right away.

LGTM

Actions #9

Updated by Peter Amstutz over 5 years ago

Tom Clegg wrote:

With for range errChan, the behavior changes a bit: if the parent context cancels, the handler blocks until all of the remote-req goroutines notice that and return. (Previously, and still in the success case, the handler returns immediately, letting the remote-req goroutines run until they notice the cancelled context.) Is this deliberate? I suggested "for len(errChan)>0" because I thought the "return immediately, let goroutines run" strategy seemed sensible to use across the board -- e.g., the "request finished" log message would happen right away.

No, that wasn't deliberate, I wasn't thinking about the case of the parent context being cancelled. I see what you mean, draining the buffer but not waiting for it to close would allow it to return slightly earlier if there are outstanding requests.

LGTM

Changed range errorChan to len(errorChan) > 0 and merged.

Actions #10

Updated by Peter Amstutz over 5 years ago

  • Status changed from In Progress to Resolved
Actions #11

Updated by Tom Morris over 5 years ago

  • Release set to 14
Actions

Also available in: Atom PDF