Project

General

Profile

Actions

Idea #2756

closed

Improve Workbench "show pipeline instance" page

Added by Tom Clegg almost 10 years ago. Updated almost 10 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Radhika Chippada
Category:
-
Story points:
5.0

Description

This page needs to be good at
  • Monitoring progress while the pipeline is running
  • Troubleshooting a failed pipeline
  • Reviewing/investigating details of a finished pipeline
Features:
  • If jobs are running, show log messages in real time using websockets
  • If jobs are queued, show queue status
  • If a job was created before pipeline itself, note "re-used existing job queued/finished at time X"
  • Show more job details, probably rearrange columns
    • component name should be plain, not styled with label, not a link to the job
    • component specification should be in its own column (the stuff independent of whether a job has been chosen to satisfy this component): script, script_version, script_parameters, etc.
    • if a job has been chosen, job detail should be in its own column (only the parts that aren't equal to the component description by definition): git commit
    • output should be its own column. Displayed as uuid (abbreviated?) link, then [first few] output filenames/sizes
    • log should be its own column. Displayed as uuid (abbreviated?) link, then direct link to the one .txt file it always contains
  • Show job log messages, e.g.,
    • Hover/pop-up to show latest+live logs for a single job that is running
    • Persistent "fake terminal window" above/below(?) with merged latest+live log messages from all jobs in this pipeline that happen to be running
Infrastructure:
  • API server should have access to the last few lines of every recent/current job log (via redis) as well as live logs (also via redis) and old/finished jobs (via log collection stored in Keep). Might need some more API to make them accessible via websocket (redis) or workbench (keep)?

Subtasks 1 (0 open1 closed)

Task #2848: Please review again branch: 2756-eventbus-in-workbenchResolvedTom Clegg05/22/2014Actions
Actions #1

Updated by Tom Clegg almost 10 years ago

  • Description updated (diff)
Actions #2

Updated by Tom Clegg almost 10 years ago

  • Story points set to 5.0
Actions #3

Updated by Radhika Chippada almost 10 years ago

  • Assigned To set to Radhika Chippada
Actions #4

Updated by Radhika Chippada almost 10 years ago

Tom says:

16:42 tomclegg
I think we can make this lean toward "test suite that runs a pipeline using workbench via capybara" -- including doing the sorts of monitoring that a human would do. ("Is my job queued, or what? What does this intermediate output look like?")

tomclegg
...I'm thinking in order to test the "show pipeline" page you'll end up running a lot of pipelines, making example pipelines with jobs that consistently fail after 30 seconds, etc. So it might be a good opportunity to look at that test suite as a "crunch+keep diagnostic" feature.

Actions #5

Updated by Tom Clegg almost 10 years ago

Looking at f533984

I suspect sessionStorage is not what we want here. This lets us share the "#foo div should get updated with events about bar" map across all browser windows. I think what we want is simpler, and we can probably take advantage of the existing browser/jQuery event mechanism, using classes/attributes as selectors.

E.g., on every page:
  • On ready or ajax:complete, if there are any $('.event-listener') elements, start a websocket connection (unless one is already open). $(window).data('websocket') is probably a convenient place to put the websocket object, see http://api.jquery.com/data/
  • When websocket message arrives, look at the object_uuid and send the received data to all interested elements with something like
$('.arv-event-listener[data-object-uuid=FOO],.arv-event-listener[data-object-uuid=all]').trigger('arv-event',[event.data]);

Then, define a handler to respond to incoming messages:

$(document).on('arv-event', '.event-handler-append-logs', function(event, eventData){
  $(this).append(eventData.summary + " happened just now\n");
}

And put the div somewhere on the pipeline instance page:

<div class="arv-event-listener event-handler-append-logs" data-object-uuid="FOO"></div>

(Sharing websocket connection across browser windows might be a worthwhile optimization, but I'm inclined to set that aside until we have the simple version working well. It seems somewhat likely to open a can of worms...)

Actions #6

Updated by Tom Clegg almost 10 years ago

Tom Clegg wrote:

I suspect sessionStorage is not what we want here. This lets us share the "#foo div should get updated with events about bar" map across all browser windows. I think what we want is simpler, and we can probably take advantage of the existing browser/jQuery event mechanism, using classes/attributes as selectors.

My bad, I was thinking of localStorage here. SessionStorage persists across page views, but only within the same window. However, I still think we don't want either of them. The one thing worth persisting across page views is the websocket connection itself, but I'm not sure this is even possible (the Storage interfaces seem to be meant for storing "data" as opposed to live objects like websocket connections) and we surely don't want to get hung up on writing/debugging this optimization now. (jQuery's $(window).data('arv-websocket') takes care of sharing one websocket everywhere in the window, which is the important thing, right?)

Actions #7

Updated by Tom Clegg almost 10 years ago

Notes @ 5cf70c36

Suggest a simplification here:

-    event_log_disp.onopen = function(event) { onEventLogDispatcherOpen(event) };
-    event_log_disp.onmessage = function(event) { onEventLogDispatcherMessage(event) };
+    event_log_disp.onopen = onEventLogDispatcherOpen;
+    event_log_disp.onmessage = onEventLogDispatcherMessage;

I think the event_log_listener_map is superfluous. For example, if N elements are listening for events about the same UUID, the following loop sends the event to all elements N times each. I think it would work better if reduced to:

-  for (var key in event_log_listener_map) {
-    value = event_log_listener_map[key];
-    if (event_uuid === value) {
-      matches = ".arv-log-event-listener[data-object-uuid=\"" + value + "\"]";
-      $(matches).trigger('arv-log-event', event.data);
-    }
-  }
+  matches = ".arv-log-event-listener[data-object-uuid=\"" + event_uuid + "\"]";
+  $(matches).trigger('arv-log-event', event.data);

("Who is listening to what" is determined by the current state of the DOM when we use the $(matches).trigger() form -- I think we'll make our lives much easier by relying on that alone, rather than creating a separate dispatch map and trying to keep it synchronized with the DOM.)

Rather than having a global event_log_disp, I think you can do this:

 function onEventLogDispatcherOpen(event) {
-  event_log_disp.send('{"method":"subscribe"}');
+  this.send('{"method":"subscribe"}');
 }
I mostly like your approach to getting the websocket endpoint available to javascript, but I have a few suggestions:
  • Put it in layouts/application.html.erb to ensure that every single page has it
  • Use a META tag instead. (Putting it in sessionStorage is either unnecessary (if we include that js on every page) or creates a strange pattern ("works differently depending on what has happened in previous page views") which could end up causing a subtle debugging exercise.)
  • Don't set the META tag (or whatever) if the user is not logged in. In JS, check for missing url, and abandon websocket behavior gracefully (probably just leave event_log_disp==null) without trying to connect. (This shouldn't matter much now since we don't listen for events on not-logged-in pages, but seems likely to be forgotten when it becomes relevant if we don't do it now.)
<meta name="x-arv-websocket-url" content="...">

the_url = $('meta[name="x-arv-websocket-url"]').attr('content');
Actions #8

Updated by Tom Clegg almost 10 years ago

  • Description updated (diff)
Actions #9

Updated by Tom Clegg almost 10 years ago

  • Description updated (diff)
Actions #10

Updated by Tom Clegg almost 10 years ago

Looking at 86a02fd

  • I think the new log buffering code in crunch-dispatch will behave badly when more than one job runs at a time.
  • Instead of using globals, use the existing j[:stderr_buf] with just a few changes:
    • When checking if j[:stderr_buf].index "\n" also check j[:stderr_flushed_at] != Time.now.to_i (flushing 1x per second regardless of size should be fine)
    • In the lines.each do |line| block, instead of making pub_msg and doing the redis stuff, store the line(s) in Log
  • rescue exceptions when saving Log (but still update :stderr_flushed_at, and replace entire log buffer with "[Failed to write logs]\n" in case Log starts working in the future). It would be a shame for crunch-dispatch to crash out and abandon all running jobs over this, but we also don't want it to go OOM.
I'd suggest
  • event_type: stderr
  • summary: nil (I don't think the postgres "string" type is long enough for the log buffer itself. Is it possible for us to summarize this stuff?)
  • properties: {"text" => j[:stderr_buf]}
Actions #11

Updated by Tom Clegg almost 10 years ago

at 8e5db22

It looks to me like j[:stderr_buf] ends up with two copies of everything by the time it gets to write_log, and there's confusion between stderr_buf (initially this was "data read just now from stderr handle") and j[:stderr_buf] (initially this was "data that has not yet been propagated to stderr and redis"). I think this

        j[:stderr_buf] << stderr_buf
        if stderr_buf.index "\n" || j[:stderr_flushed_at] != Time.now.to_i
        lines = stderr_buf.lines("\n").to_a
          lines.each do |line|
            $stderr.print "#{job_uuid} ! " unless line.index(job_uuid)
            $stderr.puts line
            log_msg = "#{Time.now.ctime.to_s} #{line.strip}" 
            j[:stderr_buf] << (log_msg + " \n")
          end

          if (LOG_BUFFER_SIZE < j[:stderr_buf].size) || ((j[:stderr_flushed_at]+1) < Time.now.to_i)
            write_log j
            j[:stderr_flushed_at] = Time.now.to_i
          end

should be more like

        j[:stderr_buf] << stderr_buf
        if j[:stderr_buf].index("\n") && j[:stderr_flushed_at] != Time.now.to_i
          lines = j[:stderr_buf].lines("\n").to_a
          lines.each do |line|
            $stderr.print "#{job_uuid} ! " unless line.index(job_uuid)
            $stderr.puts line
          end
          write_log j
          j[:stderr_flushed_at] = Time.now.to_i
  • Especially note operator precedence trap: I think if method arg && a != b is evaluated as if method (arg && a != b)...
  • The code used to prevent logging partial lines (e.g., if crunch-dispatch reads "Fo", then 5 seconds later reads "o\n", it should output a single Log "Foo\n"). Does the new code still accomplish this? (I don't see it, but I might be missing something.)
  • \n doesn't do that in single-quoted strings; on L405 I think you want "Failed to write logs\n"
  • Some extra whitespace - git diff --check master...origin/2756-eventbus-in-workbench
Actions #12

Updated by Radhika Chippada almost 10 years ago

  • Assigned To changed from Radhika Chippada to Tom Clegg

Tom, yes, there were two copies of each line, which I noticed later as I was looking at the log table entries. It is fixed now. I do not think we have the "Fo, then 5 seconds later reads o\n" issue since I am doing line splits on new line char.

I also changed "if j[:stderr_buf].index("\n") && j[:stderr_flushed_at] != Time.now.to_i" to only look for "if j[:stderr_buf].index("\n")". I think it is not required to check the flushed_at time here since I am checking it later.

Actions #13

Updated by Radhika Chippada almost 10 years ago

  • Assigned To changed from Tom Clegg to Radhika Chippada
Actions #14

Updated by Tom Clegg almost 10 years ago

Noticed typo in helper method: pipieline_log_history

  • apps/workbench/app/helpers/pipeline_instances_helper.rb
  • apps/workbench/app/views/pipeline_instances/_show_components.html.erb
Actions #15

Updated by Radhika Chippada almost 10 years ago

  • Status changed from New to Resolved
Actions

Also available in: Atom PDF