Bug #8086
closed[Workbench] Running job shows utilization graph but no logs appear
Description
Why does the graph show and not the log?
Files
Related issues
Updated by Brett Smith over 8 years ago
- Subject changed from Live view of job log does not show up to [Workbench] Running job shows utilization graph but no logs appear
Bryan,
What cluster did this happen on, and how often do you see it happening?
The utilization graph and the live log make separate Websockets connections to the API server. Your screenshot could happen if the Websockets connection for the live log timed out—in which case, this would be a duplicate of #4706.
Updated by Bryan Cosca over 8 years ago
I see it happening on qr1hi. I think pretty often? I just checked a job running now and the same thing is happening.
Updated by Brett Smith over 8 years ago
I'm also seeing this on wx7k5, which is running the same version of Workbench as qr1hi (dae3f71). Firefox console reports, "The connection to wss://ws.wx7k5.arvadosapi.com/websocket?api_token=… was interrupted while the page was loading."
Updated by Brett Smith over 8 years ago
Not seeing this on 4xphq or tb05z. tb05z is also pinned to the same Workbench version, so that suggests that the deployment environment is an issue.
My previous comment about two separate Websockets connections is a mistake: our JavaScript makes a single connection to serve all listeners.
Updated by Brett Smith over 8 years ago
I got a tcpdump starting from loading the Log tab fresh. The websockets connection gets established:
17:00:28.941201 IP locke.51857 > 137.116.76.154.https: Flags [S], seq 2063667638, win 29200, o ptions [mss 1460,sackOK,TS val 13944598 ecr 0,nop,wscale 7], length 0 17:00:28.966265 IP 137.116.76.154.https > locke.51857: Flags [S.], seq 2864683906, ack 2063667 639, win 28960, options [mss 1440,sackOK,TS val 604428251 ecr 13944598,nop,wscale 7], length 0 17:00:28.966303 IP locke.51857 > 137.116.76.154.https: Flags [.], ack 1, win 229, options [nop,nop,TS val 13944604 ecr 604428251], length 0 17:00:28.966567 IP locke.51857 > 137.116.76.154.https: Flags [P.], seq 1:518, ack 1, win 229, options [nop,nop,TS val 13944604 ecr 604428251], length 517 17:00:28.997833 IP 137.116.76.154.https > locke.51857: Flags [.], ack 518, win 235, options [nop,nop,TS val 604428259 ecr 13944604], length 0 17:00:28.999185 IP 137.116.76.154.https > locke.51857: Flags [P.], seq 1:151, ack 518, win 235, options [nop,nop,TS val 604428260 ecr 13944604], length 150 17:00:28.999249 IP locke.51857 > 137.116.76.154.https: Flags [.], ack 151, win 237, options [nop,nop,TS val 13944612 ecr 604428260], length 0 17:00:28.999729 IP locke.51857 > 137.116.76.154.https: Flags [P.], seq 518:605, ack 151, win 237, options [nop,nop,TS val 13944612 ecr 604428260], length 87 17:00:29.000015 IP locke.51857 > 137.116.76.154.https: Flags [P.], seq 605:1260, ack 151, win 237, options [nop,nop,TS val 13944612 ecr 604428260], length 655 17:00:29.030082 IP 137.116.76.154.https > locke.51857: Flags [.], ack 1260, win 245, options [nop,nop,TS val 604428267 ecr 13944612], length 0 17:00:29.359241 IP 137.116.76.154.https > locke.51857: Flags [P.], seq 151:367, ack 1260, win 245, options [nop,nop,TS val 604428350 ecr 13944612], length 216 17:00:29.360285 IP locke.51857 > 137.116.76.154.https: Flags [P.], seq 1260:1317, ack 367, win 245, options [nop,nop,TS val 13944702 ecr 604428350], length 57 17:00:29.422912 IP 137.116.76.154.https > locke.51857: Flags [.], ack 1317, win 245, options [nop,nop,TS val 604428366 ecr 13944702], length 0
Then the line goes quiet. Finally, this happens, at the same time that we get the console message noted above:
17:05:53.970031 IP locke.51857 > 137.116.76.154.https: Flags [P.], seq 1317:1356, ack 367, win 245, options [nop,nop,TS val 14025855 ecr 604428366], length 39 17:05:54.200739 IP locke.51857 > 137.116.76.154.https: Flags [P.], seq 1317:1356, ack 367, win 245, options [nop,nop,TS val 14025913 ecr 604428366], length 39 17:05:54.432743 IP locke.51857 > 137.116.76.154.https: Flags [P.], seq 1317:1356, ack 367, win 245, options [nop,nop,TS val 14025971 ecr 604428366], length 39 17:05:54.896665 IP locke.51857 > 137.116.76.154.https: Flags [P.], seq 1317:1356, ack 367, win 245, options [nop,nop,TS val 14026087 ecr 604428366], length 39 17:05:55.824747 IP locke.51857 > 137.116.76.154.https: Flags [P.], seq 1317:1356, ack 367, win 245, options [nop,nop,TS val 14026319 ecr 604428366], length 39 17:05:57.684750 IP locke.51857 > 137.116.76.154.https: Flags [P.], seq 1317:1356, ack 367, win 245, options [nop,nop,TS val 14026784 ecr 604428366], length 39 17:06:01.404659 IP locke.51857 > 137.116.76.154.https: Flags [P.], seq 1317:1356, ack 367, win 245, options [nop,nop,TS val 14027714 ecr 604428366], length 39 17:06:03.970985 IP locke.51857 > 137.116.76.154.https: Flags [FP.], seq 1356:1387, ack 367, win 245, options [nop,nop,TS val 14028355 ecr 604428366], length 31 17:06:08.852773 IP locke.51857 > 137.116.76.154.https: Flags [FP.], seq 1317:1387, ack 367, win 245, options [nop,nop,TS val 14029576 ecr 604428366], length 70 17:06:23.732762 IP locke.51857 > 137.116.76.154.https: Flags [FP.], seq 1317:1387, ack 367, win 245, options [nop,nop,TS val 14033296 ecr 604428366], length 70
Updated by Brett Smith over 8 years ago
I had a hunch that this was related to the short-lived SSH connections issue, but that tcpdump looks pretty different. There, the client closes the connection:
17:28:11.951485 IP 137.116.73.38.2222 > locke.48273: Flags [P.], seq 8144:8288, ack 8417, win 517, options [nop,nop,TS val 1691306809 ecr 14356619], length 144 17:28:11.951593 IP locke.48273 > 137.116.73.38.2222: Flags [.], ack 8288, win 483, options [nop,nop,TS val 14360350 ecr 1691306809], length 0 17:28:11.951821 IP locke.48273 > 137.116.73.38.2222: Flags [P.], seq 8417:8445, ack 8288, win 483, options [nop,nop,TS val 14360350 ecr 1691306809], length 28 17:28:11.951921 IP locke.48273 > 137.116.73.38.2222: Flags [FP.], seq 8445:8505, ack 8288, win 483, options [nop,nop,TS val 14360350 ecr 1691306809], length 60 17:28:11.989001 IP 137.116.73.38.2222 > locke.48273: Flags [.], ack 8506, win 517, options [nop,nop,TS val 1691306819 ecr 14360350], length 0 17:28:11.992123 IP 137.116.73.38.2222 > locke.48273: Flags [F.], seq 8288, ack 8506, win 517, options [nop,nop,TS val 1691306819 ecr 14360350], length 0 17:28:11.992232 IP locke.48273 > 137.116.73.38.2222: Flags [.], ack 8289, win 483, options [nop,nop,TS val 14360360 ecr 1691306819], length 0
In retrospect, it's hard to make an apples-to-apples connection, since SSH involves both an extra-cluster and intra-cluster connection working in tandem.
Updated by Brett Smith over 8 years ago
Possible explanations/fixes:
- Send a noop packet regularly to make the connection look even more alive, or detect dead connections quicker.
- The clouds that are misbehaving also run Ubuntu instead of Debian. Maybe there's a difference in Puma performance or connection limits on Ubuntu. Maybe check server logs for issues there?