Project

General

Profile

Actions

Bug #8086

closed

[Workbench] Running job shows utilization graph but no logs appear

Added by Bryan Cosca over 8 years ago. Updated over 4 years ago.

Status:
Closed
Priority:
Normal
Assigned To:
-
Category:
-
Target version:
-
Story points:
-

Description

Why does the graph show and not the log?


Files

livelog.png (80.7 KB) livelog.png Bryan Cosca, 12/29/2015 04:25 PM

Related issues

Has duplicate Arvados - Bug #5937: Log graph shoes up with no logDuplicate05/07/2015Actions
Actions #1

Updated by Brett Smith over 8 years ago

  • Subject changed from Live view of job log does not show up to [Workbench] Running job shows utilization graph but no logs appear

Bryan,

What cluster did this happen on, and how often do you see it happening?

The utilization graph and the live log make separate Websockets connections to the API server. Your screenshot could happen if the Websockets connection for the live log timed out—in which case, this would be a duplicate of #4706.

Actions #2

Updated by Bryan Cosca over 8 years ago

I see it happening on qr1hi. I think pretty often? I just checked a job running now and the same thing is happening.

Actions #3

Updated by Brett Smith over 8 years ago

I'm also seeing this on wx7k5, which is running the same version of Workbench as qr1hi (dae3f71). Firefox console reports, "The connection to wss://ws.wx7k5.arvadosapi.com/websocket?api_token=… was interrupted while the page was loading."

Actions #4

Updated by Brett Smith over 8 years ago

Not seeing this on 4xphq or tb05z. tb05z is also pinned to the same Workbench version, so that suggests that the deployment environment is an issue.

My previous comment about two separate Websockets connections is a mistake: our JavaScript makes a single connection to serve all listeners.

Actions #5

Updated by Brett Smith over 8 years ago

I got a tcpdump starting from loading the Log tab fresh. The websockets connection gets established:

17:00:28.941201 IP locke.51857 > 137.116.76.154.https: Flags [S], seq 2063667638, win 29200, o
ptions [mss 1460,sackOK,TS val 13944598 ecr 0,nop,wscale 7], length 0
17:00:28.966265 IP 137.116.76.154.https > locke.51857: Flags [S.], seq 2864683906, ack 2063667
639, win 28960, options [mss 1440,sackOK,TS val 604428251 ecr 13944598,nop,wscale 7], length 0
17:00:28.966303 IP locke.51857 > 137.116.76.154.https: Flags [.], ack 1, win 229, options [nop,nop,TS val 13944604 ecr 604428251], length 0
17:00:28.966567 IP locke.51857 > 137.116.76.154.https: Flags [P.], seq 1:518, ack 1, win 229, options [nop,nop,TS val 13944604 ecr 604428251], length 517
17:00:28.997833 IP 137.116.76.154.https > locke.51857: Flags [.], ack 518, win 235, options [nop,nop,TS val 604428259 ecr 13944604], length 0
17:00:28.999185 IP 137.116.76.154.https > locke.51857: Flags [P.], seq 1:151, ack 518, win 235, options [nop,nop,TS val 604428260 ecr 13944604], length 150
17:00:28.999249 IP locke.51857 > 137.116.76.154.https: Flags [.], ack 151, win 237, options [nop,nop,TS val 13944612 ecr 604428260], length 0
17:00:28.999729 IP locke.51857 > 137.116.76.154.https: Flags [P.], seq 518:605, ack 151, win 237, options [nop,nop,TS val 13944612 ecr 604428260], length 87
17:00:29.000015 IP locke.51857 > 137.116.76.154.https: Flags [P.], seq 605:1260, ack 151, win 237, options [nop,nop,TS val 13944612 ecr 604428260], length 655
17:00:29.030082 IP 137.116.76.154.https > locke.51857: Flags [.], ack 1260, win 245, options [nop,nop,TS val 604428267 ecr 13944612], length 0
17:00:29.359241 IP 137.116.76.154.https > locke.51857: Flags [P.], seq 151:367, ack 1260, win 245, options [nop,nop,TS val 604428350 ecr 13944612], length 216
17:00:29.360285 IP locke.51857 > 137.116.76.154.https: Flags [P.], seq 1260:1317, ack 367, win 245, options [nop,nop,TS val 13944702 ecr 604428350], length 57
17:00:29.422912 IP 137.116.76.154.https > locke.51857: Flags [.], ack 1317, win 245, options [nop,nop,TS val 604428366 ecr 13944702], length 0

Then the line goes quiet. Finally, this happens, at the same time that we get the console message noted above:

17:05:53.970031 IP locke.51857 > 137.116.76.154.https: Flags [P.], seq 1317:1356, ack 367, win 245, options [nop,nop,TS val 14025855 ecr 604428366], length 39
17:05:54.200739 IP locke.51857 > 137.116.76.154.https: Flags [P.], seq 1317:1356, ack 367, win 245, options [nop,nop,TS val 14025913 ecr 604428366], length 39
17:05:54.432743 IP locke.51857 > 137.116.76.154.https: Flags [P.], seq 1317:1356, ack 367, win 245, options [nop,nop,TS val 14025971 ecr 604428366], length 39
17:05:54.896665 IP locke.51857 > 137.116.76.154.https: Flags [P.], seq 1317:1356, ack 367, win 245, options [nop,nop,TS val 14026087 ecr 604428366], length 39
17:05:55.824747 IP locke.51857 > 137.116.76.154.https: Flags [P.], seq 1317:1356, ack 367, win 245, options [nop,nop,TS val 14026319 ecr 604428366], length 39
17:05:57.684750 IP locke.51857 > 137.116.76.154.https: Flags [P.], seq 1317:1356, ack 367, win 245, options [nop,nop,TS val 14026784 ecr 604428366], length 39
17:06:01.404659 IP locke.51857 > 137.116.76.154.https: Flags [P.], seq 1317:1356, ack 367, win 245, options [nop,nop,TS val 14027714 ecr 604428366], length 39
17:06:03.970985 IP locke.51857 > 137.116.76.154.https: Flags [FP.], seq 1356:1387, ack 367, win 245, options [nop,nop,TS val 14028355 ecr 604428366], length 31
17:06:08.852773 IP locke.51857 > 137.116.76.154.https: Flags [FP.], seq 1317:1387, ack 367, win 245, options [nop,nop,TS val 14029576 ecr 604428366], length 70
17:06:23.732762 IP locke.51857 > 137.116.76.154.https: Flags [FP.], seq 1317:1387, ack 367, win 245, options [nop,nop,TS val 14033296 ecr 604428366], length 70
Actions #6

Updated by Brett Smith over 8 years ago

I had a hunch that this was related to the short-lived SSH connections issue, but that tcpdump looks pretty different. There, the client closes the connection:

17:28:11.951485 IP 137.116.73.38.2222 > locke.48273: Flags [P.], seq 8144:8288, ack 8417, win 517, options [nop,nop,TS val 1691306809 ecr 14356619], length 144
17:28:11.951593 IP locke.48273 > 137.116.73.38.2222: Flags [.], ack 8288, win 483, options [nop,nop,TS val 14360350 ecr 1691306809], length 0
17:28:11.951821 IP locke.48273 > 137.116.73.38.2222: Flags [P.], seq 8417:8445, ack 8288, win 483, options [nop,nop,TS val 14360350 ecr 1691306809], length 28
17:28:11.951921 IP locke.48273 > 137.116.73.38.2222: Flags [FP.], seq 8445:8505, ack 8288, win 483, options [nop,nop,TS val 14360350 ecr 1691306809], length 60
17:28:11.989001 IP 137.116.73.38.2222 > locke.48273: Flags [.], ack 8506, win 517, options [nop,nop,TS val 1691306819 ecr 14360350], length 0
17:28:11.992123 IP 137.116.73.38.2222 > locke.48273: Flags [F.], seq 8288, ack 8506, win 517, options [nop,nop,TS val 1691306819 ecr 14360350], length 0
17:28:11.992232 IP locke.48273 > 137.116.73.38.2222: Flags [.], ack 8289, win 483, options [nop,nop,TS val 14360360 ecr 1691306819], length 0

In retrospect, it's hard to make an apples-to-apples connection, since SSH involves both an extra-cluster and intra-cluster connection working in tandem.

Actions #7

Updated by Brett Smith over 8 years ago

Possible explanations/fixes:

  • Send a noop packet regularly to make the connection look even more alive, or detect dead connections quicker.
  • The clouds that are misbehaving also run Ubuntu instead of Debian. Maybe there's a difference in Puma performance or connection limits on Ubuntu. Maybe check server logs for issues there?
Actions #8

Updated by Peter Amstutz over 4 years ago

  • Status changed from New to Closed
Actions

Also available in: Atom PDF